Removing all tags except a few whitelisted ones with a regular expression_问答_开发者

Removing all tags except a few whitelisted ones with a regular expression

开发者 https://www.devze.com 2023-02-07 06:58 出处：网络

I have a text with some HTML-like tags, which I would like to remove. I only want to allow about a dozen whitelisted tags, like <b> or <i>. I can\'t use PHP\'s strip tags, as I need a more

I have a text with some HTML-like tags, which I would like to remove. I only want to allow about a dozen whitelisted tags, like <b> or <i>. I can't use PHP's strip tags, as I need a more general solution using regular expressions (as some of my other tags use d开发者_Go百科ifferent conventions, for example [tag] instead of <tag>). How do achieve this effect?

The regular expression I use right now is:

return preg_replace('/ \<[^\>]+\>/', '', $text);

How should I change it to exclude the tags I mentioned? I looked through similar questions but they don't provide a solution to the specific problem I mentioned here.

If you can't use PHP's strip_tags(), use HTMLPurifier, which will allow you to implement all sorts of rules, safely.

To answer your question anyway, you could use an assertion (?!..) to exclue things from matching:

preg_replace('#<(?!/?(a|b|i|div)\b)[^>]+>#'

But take in mind that this is not a very reliable approach. Filtering tag names is the easy part. For a complete sanitization you'd have to clean up attributes, where it becomes complicated. Try HTMLPurifier, which already contains heaps of regular expressions to do so.

$wl = '(?!(?:b|tr|td)\b)';   // whitelist in group

$rxtags = '
<
(?:
    (?:
       (?:
           (?:' ."$wl". 'script|' ."$wl". 'style) \s*
         | (?:' ."$wl". 'script|' ."$wl". 'style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
       )> .*? </(?:' ."$wl". 'script|' ."$wl". 'style)\s*
    )
 |
    (?:
        /?' ."$wl". '\w+\s*/?
      | '   ."$wl". '\w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
      | !(?:DOCTYPE.*?|--.*?--)
    )
)
>';

s/$rxtags//xsg

"/$rxtags/xs", modifiers: expanded, span, globally

And change ' . "$wl" . ' to ' + "$wl" + ' or however catenation is done in php.