开发者

advantages from htmlpurifier instead of regex filtering

开发者 https://www.devze.com 2023-01-10 19:25 出处:网络
We have recently implemented htmlpurifier in our web-based application. Earlier we used to have regexes to match commonly known XSS injections (script, img, etc. etc). We realized that this wasn\'t go

We have recently implemented htmlpurifier in our web-based application. Earlier we used to have regexes to match commonly known XSS injections (script, img, etc. etc). We realized that this wasn't good enough and hence moved to htmlpurifier. Now given that htmlpurifier is slow in working (very slow compared to the regex method we had earlier), is it really worth to have htmlpurifier? Or does it make any sense to keep increasing the reg开发者_Python百科ex filtering until we reach a satisfactory level (it might be argued that the speed benefits would be nullified by that time). Anyone else who has faced similar issues with security for their web application and what did you do in the end?

Please let know if anything seems vague; I would be happy to provide more details.


Using a regex for html/javascript? Perhaps you have not seen this epic answer by Mr Bobice. In short if you use a regex then you have two problems. In fact the reason why HTML Purifier is so slow is because it uses hundreds of calls to preg_match() and preg_repalce() in order to clean a message. You must never re-invent the wheal, without a doubt be less secure.

The real question is htmlspeicalchars($var,ENT_QUOTES); vs HTML Purifier. HTML Purifer is not only slow, it has been hacked, many times. Don't use HTML Purifier unless there is no other choice, htmlspeicalchars solves most problems and it solves it in a way that cannot be bypassed.


The problem with regexes is that filtering HTML is too complex a task to be able to do easily, or elegantly, with regexes without creating a big mess.

You need to build something that actually understands HTML and can operate on it as HTML, and know how a browser is going to interpret something. Regexes operate on it as if it's just one big long string. They're not good or elegant at parsing HTML in a stateful manner, for example recognising that a current match is within a comment, or within an attribute, or within a element etc. It's just really complicated to emulate that in regexes.

The other issue is that 'matching commonly known XSS injections' is way more complex than it sounds. If it isn't, you're not doing it right. Your filter needs to know HTML, it needs to know what a valid URL scheme is and how null bytes work in different parts of HTML etc. Basically, most of the injections on the XSS cheat sheet, for example, are based on getting around filtering done by regex-based filters.

And one more thing is that HTML purifier is maintained by someone who knows what they're doing. You can trust it, and you can trust that if there's a new flaw in it it'll be patched. That can save you a lot of work trying to do the same thing on your own and ensure that you remain up to date with all of the different patches out there.


It's better to be safe than sorry. There's a whole slew of attacks your regular expressions might not find. For example, here's just a few. If HTML Purifier is too slow, see if caching the purified HTML helps.

0

精彩评论

暂无评论...
验证码 换一张
取 消