开发者

Extracting info from html using PHP(XPath), PHP/Python(Regexp) or Python(XPath)

开发者 https://www.devze.com 2022-12-08 20:34 出处:网络
I have approx. 40k+ html documents where I need to extract information from. I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow....

I have approx. 40k+ html documents where I need to extract information from. I have tried to do so using PHP+Tidy(because most files are not well-formed)+DOMDocument+XPath but it is extremely slow.... I am advised to use regexp but the html files are not marked up semantically (table based layout, with meaning-less tag/classes used everywhere) and I don't know where i should start...

Just being curious, is using regexp (PHP/Python) faster than using P开发者_JAVA技巧ython's XPath library? Is Xpath library for Python generally faster than PHP's counterpart?


If speed is a requirement have a look at lxml. lxml is a pythonic binding for the libxml2 and libxslt C libraries. Using the C libraries is much faster than any pure php or python version.

There are some impressive benchmarks from Ian Bicking:

In Conclusion

I knew lxml was fast before I started these benchmarks, but I didn’t expect it to be quite this fast.

Parsing Results:

Parsing Resutls http://1.2.3.9/bmi/blog.ianbicking.org/wp-content/uploads/images/parsing-results.png


You might give Beautiful Soup in Python a try. It's a pretty great parser for generating a usable DOM out of garbage HTML. That with some regex skills might get you what you need. Happy hunting!

Most comparative operations in Python are faster than in PHP in my subjective experience. Partly due to Python being a compiled language instead of interpreted at runtime, partly due to Python having been optimized for greater efficiency by its contributors...

Still, for 40k+ documents, find a nice fast machine ;-)


As the previous post mentions Python in general is faster than php due to byte-code compilation (those .pyc files). And a lot of DOM/SAX parsers use fair bit of regexp internally anyway. Those who told you to use regexp need to be told that it is not a magic bullet. For 40k+ documents I would recommend parallelizing the task using the new multi-threads or the classic parallel python.

0

精彩评论

暂无评论...
验证码 换一张
取 消