Regex (or better suggestion) on html with correct nesting_问答_开发者

Regex (or better suggestion) on html with correct nesting

开发者 https://www.devze.com 2023-02-14 10:00 出处：网络

I\'ve had a look and there don\'t seem to be any old questions that directly address this.I also haven\'t found a clear solution anywhere else.

I've had a look and there don't seem to be any old questions that directly address this. I also haven't found a clear solution anywhere else.

I need a way to match a tag, open to close, and return everything enclosed by the tag. The regexes I've tried have problems when tags are nested. For example, the regex <tag\b[^>]*>(.*?)</tag> will cause trouble with <tag>Some text <tag>that is nested</tag> in tags</tag>. It will match <tag>Some text <tag>that is nested</tag>.

I'm looking a solution to this. Ideally an efficient one. I've seen solutions that involve matching on start and end tags separately and keeping track of their index in the content to work out which tags go together but that seems wildly inefficient to me (if it's the only possible way then c'est la vie).

The solution must be PHP only as this is the language I have to work with. I'm parsing html snippets (think body sections from a wordpress blog and you're not too far off). If there is a better than regex solution, I'm all ears!

UPDATE:

Just to make it clear, I'm aware regexes ar开发者_运维问答e a poor solution but I have to do it somehow which is why the title specifically mentions better solutions.

FURTHER UPDATE:

I'm parsing snippets. Solutions should take this into account. If the parser only works on a full document or is going to add <head> etc... when I get the html back out, it's not an acceptable solution.

As always, you simply cannot parse HTML with regex because it is not a regular language. You either need to write a real HTML parser, or use a real HTML parser (that someone's already written). For reasons that should be obvious, I recommend the latter option.