开发者

Regex (or better suggestion) on html with correct nesting

开发者 https://www.devze.com 2023-02-14 10:00 出处:网络
I\'ve had a look and there don\'t seem to be any old questions that directly address this.I also haven\'t found a clear solution anywhere else.

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.

I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.

I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).

The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!

UPDATE:

Just to make it clear, I'm aware regexes ar开发者_运维问答e a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.

FURTHER UPDATE:

I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.


As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.

Relevant questions

  • Robust and Mature HTML Parser for PHP
  • How do you parse and process HTML/XML in PHP?


Why not just use DOMDocument::loadHTML? It uses libxml under the hood which is fast and robust.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号