开发者

XML parser vs regex

开发者 https://www.devze.com 2023-03-28 08:42 出处:网络
What should I use? I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.

What should I use?

I am going to fetch links, images, text, etc and use it for using it building seo statistics and analysis of the page.

What do you recommend to be used? XML Parser or regex

I have been using regex and never have had any problems with it however, I have been hearing from people that it can not do some things and blah blah blah...but to be honest I don't know why but I am afraid to use XML parser and prefer regex (and it works and serves the purpose pretty well)

So, if everything is working well with regex why am I here to ask you what to use? Well, I think that even though everything has been fine so far doesn't mean it will be in the future as well, so I just wanted to know what are the benifits of using a XML parser over regex? Are there any imp开发者_StackOverflowrovements in performances, less error prone, better support, other shine features, etc?

If you do suggest to use XML parser then which is recommended one to be used with PHP

I would most definitely like to know why would you pick one over the other?


What should I use?

You should use an XML Parser.

If you do suggest to use XML parser then which is recommended one to be used with PHP

See: Robust and Mature HTML Parser for PHP .


If you're processing real world (X)HTML then you'll need an HTML parser not an XML parser, because XML parsers are required to stop parsing as soon as they hit a well-formedness error, which will be almost immediately with most HTML.

The point against regex for processing HTML is that it isn't reliable. For any regex, there will be HTML pages that it will fail on. HTML parsers are just as easy to use as regex, and process HTML just like a browser does, so are very much more reliable and there's rarely any reason not to use one.

One possible exception is sampling for statistics purposes. Suppose you're going to scan 100,000 web pages for a fairly simple pattern, for example, the presence of a particular attribute, and return the percentage of matching pages that you get. While even a well designed regex will likely produce both false positives and false negatives, they are unlikely to affect the overall percentage score by very much. You may be able to accept those false matches for the benefit that a regex scan is likely to run more quickly than a full parse of each page. You can then reduce the number of false positives by running a parse only on the pages which return a regex match.

To see the kind of problems that will cause difficulties for regexes see: Can you provide some examples of why it is hard to parse XML and HTML with a regex?


It sounds to me as if you are doing screen-scraping. This is inevitably a somewhat heuristic process - you're looking for patterns that commonly occur in the web pages of interest, and you're inevitably going to miss a few of them, and you don't really mind. For example, you don't really care that your search for img tags will also find an img tag that happens to be commented out. If that characterizes your application, then the usual strictures against using regular expressions for processing HTML or XML might not apply to your case.

0

精彩评论

暂无评论...
验证码 换一张
取 消