I am trying to parse the keywords from google suggest, this is the url:
http://google.com/complete/search开发者_Go百科?output=toolbar&q=test
I've done it with php using:
'|<CompleteSuggestion><suggestion data="(.*?)"/><num_queries int="(.*?)"/></CompleteSuggestion>|is'
But that wont work with python re.match(pattern, string), I tried a few but some show error and some return None.
How can I parse that info? I dont want to use minidom because I think regex will be less code.
You could use etree
:
>>> from xml.etree.ElementTree import XMLParser
>>> x = XMLParser()
>>> x.feed('<toplevel><CompleteSuggestion><suggestion data=...')
>>> tree = x.close()
>>> [(e.find('suggestion').get('data'), int(e.find('num_queries').get('int')))
for e in tree.findall('CompleteSuggestion')]
[('test internet speed', 31800000), ('test', 686000000), ...]
It is more code than a regex, but it also does more. Specifically, it will fetch the entire list of matches in one go, and unescape any weird stuff like double-quotes in the data
attribute. It also won't get confused if additional elements start appearing in the XML.
RegEx match open tags except XHTML self-contained tags
This is an XML document. Please, reconsider an XML parser. It will be more robust and probably take you less time in the end, even if it is more code.
精彩评论