开发者

Skip preceding html entities in javascript regex

开发者 https://www.devze.com 2023-03-12 10:35 出处:网络
I\'m using regex snippets to par开发者_运维百科se smileys into images, and encountering problems with the semicolon.For example, a smiley like ;) turns into a WINK icon, matching against

I'm using regex snippets to par开发者_运维百科se smileys into images, and encountering problems with the semicolon. For example, a smiley like ;) turns into a WINK icon, matching against

/;-?\)/g

and works in most cases. But text like ") is also matching into "WINK, because the quotation mark is actually an html entity (" => &quotWINK).

I tried prefixing the regex with a greedy non-capturing match to discard semicolons in entities:

(?:"|&|<|>|'|')?

But the resulting pattern still matches against the semicolon in " that should be skipped, because it backtracks to satisfy the non-optional latter portion. I also realized there'd still be problems with other legitimate matches, such as EVIL: >:) => >:).

So what it appears I really need is the negation of preceding html entities missing a semicolon:

(?!&quot|&amp|&lt|&gt|&apos|&#039)

But it is still matching and I'm not sure why.

It would be ideal to still get returned matches that can be replaced wholesale without further inspection, but I'm open to suggestions. What is not suitable is first parsing out the html entities, because sometimes they're necessary and/or part of a legitimate smiley (as with EVIL).


EDIT (some Google food):

I discovered (and Bryan also noted below) that Zero-width positive lookbehind, (?<!regex), would work as desired (not Zero-width negative lookahead (?!regex)).

As per regular-expressions.info, the latter "will only succeed if the regex inside the lookahead fails to match" which sounds right, but isn't when the section is optional anyway.

In contrast, the former "matches at a position if the pattern inside the lookahead can be matched ending at that position" which isn't at all clear, but does the trick. Because the match is using lookbehind, there's no chance of backtracking to satisfy the latter portion of the regex.

So a full regex looks like:

/(?<!&quot|&amp|&lt|&gt|&apos|&#039);-?\)/g

and that matches these: ;) => WINK blah;) => blahWINK &quot;;) => &quot;WINK while failing this: &quot;) It does however still match &amp;quot;) => &amp;quotWINK, so more tweaking would be ideal (such as additionally matching semicolons in place of the ampersands, if that doesn't cause other smileys with entities in them to break). People typing out html entities in chat isn't likely to come up much anyway.

Either way would be "good enough"--except that javascript doesn't support negative lookbehind. But it's worth explaining for the sake of other regex implementations.


First off, you could just check for whitespace characters preceding the ;-), so that's a simpler option. But if you do want to implement negative lookbehind in JavaScript -- which exists in several flavors of regex but not in the JS one -- you could do something like this:

var text = "& &amp; &amp;-) ;-) test;-)";
var ENTITIES_REGEX = /(&quot|&amp|&lt|&gt|&apos|&#039)?;-\)/g;

var result = text.replace(ENTITIES_REGEX, function(fullMatch, backref1) {
  // Ignore if there is a backreference by returning the unaltered
  // match, otherwise return WINK
  return (backref1 ? fullMatch : 'WINK');
});

// result equals "& &amp; &amp;-) WINK testWINK"

And here's an example.

0

精彩评论

暂无评论...
验证码 换一张
取 消