I've got defective input coming in that looks like this...
foo<p>bar</p>
And I want to normalize it to wrap the leading text in a p tag:
<p>foo</p><p>bar</p>
This is easy enough with the regex replace of /^([^<]+)/
with <p>$1</p>
. Problem is, sometimes the leading chunk contains tags other than p, like so:
foo <b>bold</b><p>bar</p>
This should wrap the whole chunk in a new p:
<p>foo <b>bold</b></p><p>bar</p开发者_JAVA百科>
But since the simple regex looks only for <
, it stops at <b>
and spits out:
<p>foo </p><b>bold</b><p>bar</p> <!-- oops -->
So how do I rewrite the regex to match <p
? Apparently the answer involves negative lookahead, but this is a bit too deep for me.
(And before the inevitable "you can't parse HTML with regexes!" comment, the input is not random HTML, but plain text annotated with only the tags <p>
, <a>
, <b>
and <i>
, and a/b/i may not be nested.)
I think you actually want positive lookahead. It's really not bad:
/^([^<]+)(?=<p)/
You just want to make sure that whatever comes after <
is p
, but you don't want to actually consume <p
, so you use a lookahead.
Examples:
> var re = /^([^<]+)(?=<p)/g;
> 'foo<p>bar</p>'.replace(re, '<p>$1</p>');
"<p>foo</p><p>bar</p>"
> 'foo <b>bold</b><p>bar</p>'.replace(re, '<p>$1</p>')
"foo <b>bold</b><p>bar</p>"
Sorry, wasn't clear enough in my original posting: my expectation was that the "foo bold" bit would also get wrapped in a new
p
tag, and that's not happening.Also, every now and then there's input with no
p
tags at all (just plainfoo
), and that should also map to<p>foo</p>
.
The easiest way I found to get this working is to use 2 separate regexps, /^(.+?(?=<p))/
and /^([^<]+)/
.
> var re1 = /^(.+?(?=<p))/g,
re2 = /^([^<]+)/g,
s = '<p>$1</p>';
> 'foo<p>bar</p>'.replace(re1, s).replace(re2, s);
"<p>foo</p><p>bar</p>"
> 'foo'.replace(re1, s).replace(re2, s);
"<p>foo</p>"
> 'foo <b>bold</b><p>bar</p>'.replace(re1, s).replace(re2, s);
"<p>foo <b>bold</b></p><p>bar</p>"
It's possible to write a single, equivalent regexp by combining re1
and re2
:
/^(.+?(?=<p)|[^<]+)/
> var re3 = /^(.+?(?=<p)|[^<]+)/g,
s = '<p>$1</p>';
> 'foo<p>bar</p>'.replace(re3, s)
"<p>foo</p><p>bar</p>"
> 'foo'.replace(re3, s)
"<p>foo</p>"
> 'foo <b>bold</b><p>bar</p>'.replace(re3, s)
"<p>foo <b>bold</b></p><p>bar</p>"
精彩评论