Matching a string only if it is not in <script> or <a> tags_问答_开发者

I'm working on a browser plugin that replaces all instances of "someString" (as defined by a complicated regex) with <a href="http://domain.com/$1">$1</a>. This generally works ok just doing a global replace on the body's innerHTML. However it breaks the page when it finds (and replaces) the "someString" inside <script> tags (i.e. as a JS variable or other JS reference). It also breaks if "someString" is already part of an anchor.

So basically I want to do a global replace on all instances of "someString" unless it falls inside a <script></script> or <a></a> tag set.

Essentially what I have now is:

var body = document.getElementsByTagName('body')[0].innerHTML;
body = body.replace(/(someString)/gi, '<a href="http://domain.com/$1">$1</a>');
document.getElementsByTagName('body')[0].innerHTML = body;

But obviously that's not good enough. I've been struggling for a couple hours now and reading all of the answers here (including the many adamant ones that insist regex should not be used with HTML), so I'm open to suggestions on how to do this. I'd prefer using straight JS, but can use jQuery if necessary.

Edit - Sample HTML:

<body>
  someString
  <script type="text/javascript">
  var someString = 'blah';
  console.log(someString);
  </script>
  <a href="someString.html">some开发者_StackOverflow社区String</a>
</body>

In that case, only the very first instance of "someString" should be replaced.

Try this and see if it meets your needs (tested in IE 8 and Chrome).

<script src="jquery-1.4.4.js" type="text/javascript"></script>
<script>
  var pattern = /(someString)/gi;
  var replacement = "<a href=\"http://domain.com/$1\">$1</a>";

  $(function() {
    $("body :not(a,script)")
      .contents()
      .filter(function() { 
        return this.nodeType == 3 && this.nodeValue.search(pattern) != -1;
      })
      .each(function() {
        var span = document.createElement("span");
        span.innerHTML = "&nbsp;" + $.trim(this.nodeValue.replace(pattern, replacement));
        this.parentNode.insertBefore(span, this);
        this.parentNode.removeChild(this);
      });
  });
</script>

The code uses jQuery to find all the text nodes within the document's <body>that are not in <anchor> or <script> blocks, and contain the search pattern. Once those are found, a span is injected containing the target node's modified content, and the old text node is removed.

The only issue I saw was that IE 8 handles text nodes containing only whitespace differently than Chrome, so sometimes a replacement would lose a leading space, hence the insertion of the non-breaking space before the text containing the regex replacements.

Well, You can use XPath with Mozilla (assuming you're writing the plugin for FireFox). The call is document.evaluate. Or you can use an XPath library to do it (there are a few out there)...

var matches = document.evaluate(
    '//*[not(name() = "a") and not(name() = "script") and contains(., "string")]',
    document,
    null,
    XPathResult.UNORDERED_NODE_ITERATOR_TYPE
    null
);

Then replace using a callback function:

var callback = function(node) {
    var text = node.nodeValue;
    text = text.replace(/(someString)/gi, '<a href="http://domain.com/$1">$1</a>');
    var div = document.createElement('div');
    div.innerHTML = text;
    for (var i = 0, l = div.childNodes.length; i < l; i++) {
        node.parentNode.insertBefore(div.childNodes[i], node);
    }
    node.parentNode.removeChild(node);
};
var nodes = [];
//cache the tree since we want to modify it as we iterate
var node = matches.iterateNext();
while (node) {
    nodes.push(node);
    node = matches.iterateNext();
}
for (var key = 0, length = nodes.length; key < length; key++) {
    node = nodes[key];
    // Check for a Text node
    if (node.nodeType == Node.TEXT_NODE) {
        callback(node);
    } else {
        for (var i = 0, l = node.childNodes.length; i < l; i++) {
            var child = node.childNodes[i];
            if (child.nodeType == Node.TEXT_NODE) {
                callback(child);
            }
        }
    }
}

I know you don't want to hear this, but this doesn't sound like a job for a regex. Regular expressions don't do negative matches very well before becoming complicated and unreadable.

Perhaps this regex might be close enough though:

/>[^<]*(someString)[^<]*</

It captures any instance of someString that are inbetween a > and a <.

Another idea is if you do use jQuery, you can use the :contains pseudo-selector.

$('*:contains(someString)').each(function(i)
{
    var markup = $(this).html();
    // modify markup to insert anchor tag
    $(this).html(markup)
});

This will grab any DOM item that contains 'someString' in it's text. I dont think it will traverse <script> tags or so you should be good.

You could try the following:

/(someString)(?![^<]*?(<\/a>|<\/script>))/

I didn't test every schenario, but it is basically using a negative lookahead to look for the next opening bracket following someString, and if that bracket is part of an anchor or script closing tag, it does not match.

Your example seems to work in this fiddle, although it certainly doesn't cover all possibilities. In cases where the innerHTML in your <a></a> contains tags (like <b> or <span>), or the code in your script tags generates html (contains strings with tags in it), you would need something more complex.