开发者

Keeping file offsets while parsing HTML with the DOM?

开发者 https://www.devze.com 2023-01-24 08:13 出处:网络
I want to modify <img src=\"\"> attributes in not-too-malformed HTML (WordPress posts). I know I can take the simple way and use regexes, but I\'m afraid people in blue furry suits will come hau

I want to modify <img src=""> attributes in not-too-malformed HTML (WordPress posts). I know I can take the simple way and use regexes, but I'm afraid people in blue furry suits will come haunt me in my sleep.

If I use the DOM parser to read the HTML and modify the <img> tags, I'm afraid I can't reconstruct the post exactly as it was (with only my modification), because the DOM parser will probably do too much cleanup and maybe remove essential data. A SAX parser can probably not handle invalid XML, so this will also not work.

So, is there a middle way, where 开发者_JS百科I can use a DOM parser, but one that knows where each element started, so I can do string replacements or something similar from there? I know some nodes in the DOM tree will not exist in the source document (<b>Some <i>bizarre</b> formatting</i> will probably trigger this), but does this mean it is always impossible? I see there is a DOMNode::getLineNo() function added in PHP 5.3, but I'm using 5.2.x.


If PHP's DOM will write "too clean" results, you could try string-based SimpleHTMLDOM whether it's more lenient.

However, with formatting as bizarre as you show, I would never entirely trust the parser to do it "right". But try it out, maybe it just skips such stuff.

The DOM library's DOMNode class has a getLineNo() method. I don't entirely see how this works though, seeing as it doesn't provide an offset to go with it. Not sure whether that'll help your use case.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号