I'm using the perl command line utility xpath to extract data from some HTML code as follows:
#!/bin/bash
echo $HTML | xpath -q -e "//h2[1]"
The HTML is malformed which causes xpath to throw the below error:
not well-formed (invalid token) at line X, column Y, byte Z:
I can't really fix the HTML since it's provided by an external source which means every time the HTML is changed I would have to fix it m开发者_如何学JAVAanually again.
I looked for xpath man which is pretty empty: http://www.linuxcertif.com/man/1/xpath.1p/
I was wondering whether there would be a way to tell xpath to ignore the malformed HTML. To give you an idea of how malformed it is here are few lines from the source code:
<div id="header-background" style="top: 42px; > </div> <---- missing closing "
<div id-"page-inner"> <---- - instead of =
Thanks
Try out HTML::TreeBuilder::XPath which uses an HTML parser to build a document which can then be queried using xpaths. An HTML Parser should be ok with malformed XML.
Also see this article on HTML Scraping with XPath.
xml_grep
, a command line tool which comes with XML::Twig, can be used to extract data from HTML using XPath. Normally it works on XML, but you can use the -html
option to process HTML (under the hood it uses HTML::TreeBuilder to convert the XML to HTML).
For example:
> xml_grep -html -t 'a[@class="genu"]' http://stackoverflow.com
> Stack Exchange
精彩评论