开发者

Regex to replace a string in HTML but not within a link or heading

开发者 https://www.devze.com 2023-01-02 19:17 出处:网络
开发者_StackOverflow中文版I am looking for a regex to replace a given string in a html page but only if the string is not a part of the tag itself or appearing as text inside a link or a heading.

开发者_StackOverflow中文版I am looking for a regex to replace a given string in a html page but only if the string is not a part of the tag itself or appearing as text inside a link or a heading.

Examples:

Looking for 'replace_me'

<p>You can replace_me just fine</p> OK

<a href='replace_me'>replace_me</a> no match

<h3>replace_me</h3> no match

<a href='/test/'><span>replace_me</span></a> no match

<p style="background:url('replace_me')">replace_me<h1>replace_me</h1></p> first no match, second OK, third no match

Thanks in advance!

UPDATE:

I have found a working regex

\b(replace_me)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)


Parsing HTML with regex is a Bad Idea that will drive you insane. Using regex on this is probably not quite as bad, but a few things to think about in whatever approach you take:

  1. How many of these are there in a page?
  2. How many pages will you be doing this to?
  3. Will you be hand-checking the output, or is it automated?
  4. Which programming language(s) are you using for this?

I think the best way is not with a "simple" (read: horrendously complicated) regex, but a proper program that has some logic behind it - unless regular expressions are Turing Complete and someone else can provide a regex to do what you want, of course :)


\b(replace_me)\b(?!(?:(?!<\/?[ha].*?>).)*<\/[ha].*?>)(?![^<>]*>)


I had a similar issue - given a string of HTML I wanted to replace all instances of the string tio2 with TiO<sub>2</sub>, and ticl4 with TiCl<sub>4</sub>.

This was easy to accomplish with simple string replacement but there are were some instances where the 'needle' strings occur in domain names e.g. www.ilovetio2.com, www.tastytastyticl4.info. In these cases the href attributes would be broken by the string replacement.

Rather than mess around trying to find a single, complex regex I opted to make two passes over the HTML string:

  • Replace ALL instances with str_ireplace
  • Find any href attributes containing <sub>...</sub> and fix them preg_replace_callback

    public static function subscriptStrings($str)
    {
    
        // $str is arbitrary string which may be HTML, may be plain text
    
        // Define search / replacements
        $map = [
            'tio2' => 'TiO<sub>2</sub>',
            'ticl4' => 'TiCl<sub>4</sub>'
        ];
    
        // Replace ALL instances, paying no heed to their context
        $str = str_ireplace(array_keys($map), array_values($map), $str);
    
        // Make a second pass, specifically looking for href values
        $str = preg_replace_callback('/href="[^"]+"/', function ($str) {
    
            // Return the href value stripped of <sub> tags
            return str_replace(['<sub>', '</sub>'], '', $str[0]);
        }, $str);
    
        return $str;
    }
    

This is not bulletproof and will fail if for some reason the links in question should have in them for some reason.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号