Regex matching items following a header in HTML_问答_开发者

What should be a fairly simple regex extraction is confounding me. Couldn't find a similar question on SO, so happy to be pointed to one if it exists. Given the following HTML:

<h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>

<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>

(amongst a larger document - the extracts will most probably run across multiple lines)

How can I construct a regular expression that finds the text within the A tags, within the first P following an H1? The regex will go in a loop, such that I can pass in the header, in order to retrieve the items that follow.

<a[^>]*>([0-9.]+?)</a> obviously matches all items in a tag (and should be fine as a ta开发者_JAVA百科gs cannot be nexted), but I can't tie them to an H1.

.+Title One.+<a[^>]*>([0-9.]+?)</a></p> fails.

I had tried to use look behind as so:

(?<=Title One.+)<a[^>]*>([0-9.]+?)</a></p> and some variations but it is only allowed for fixed width matches (which won't be the case here).

For context, this will be using Python's regex engine. I know regex isn't necessarily the best solution for this, so alternative suggestions using DOM or something else also gratefully received :)

Update

To clarify from the above, I'd like to get back the following:

{"Title One": ["40.5", "31.3"], "Title Two": ["12.1", "82.0"]}

(not that I need help composing the dictionary, but it does demonstrate how I need the values to be related to the title).

So far BeautifulSoup looks like the best shot. LXML will also probably work as the source HTML isn't really tag-soup - it's pretty well-structured, at least in the places I'm interested in.

Is this the kind of thing you're after?

>>> from lxml import etree
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> d.xpath('//h1/following-sibling::p[1]/a/text()')
['40.5', '31.3', '12.1', '82.0']

This solution uses lxml.etree and an xpath expression.

Update

>>> from lxml import etree
>>> from pprint import pprint
>>>
>>> data = """
... <h1 class="title">Title One</h1><p><a href="#">40.5</a><a href="#">31.3</a></p>
... <h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
... """
>>>
>>> d = etree.HTML(data)
>>> #d.xpath('//h1[following-sibling::*[1][local-name()="p"]]') 
...
>>> results = {}
>>> for h in d.xpath('//h1[following-sibling::*[1][local-name()="p"]]'):
...   r = results.setdefault(str(h.text),[])
...   r += [ str(x) for x in h.xpath('./following-sibling::*[1][local-name()="p"]/a/text()') ]
...
>>> pprint(results)
{'Title One': ['40.5', '31.3'], 'Title Two': ['12.1', '82.0']}

Now using predicates to look ahead, this should iterate through <h1> tags which are immediately followed by <p> tags. ( Casting tag.text to strings explicitly as I have a recollection that they aren't normal strings, you'd have trouble pickling them, etc.)

You're right, regex is absolutely the wrong tool for HTML matching.

Your question, however, sounds exactly like the problem for Beautiful Soup - a HTML parser that can deal with less-than-perfect HTML.

The other obvious answer to solve this problem is BeautifulSoup -- I like that it handles the kind of crappy html that you often run into out in the wild as sensibly and gracefully as you can hope.

Don't use regex to parse html. That can't be done, by definition. Use a html parser instead. I suggest lxml.html.

lxml.html deals with badly formed html better than BeautifulSoup, is actively maintained (BeautifulSoup isn't) and is a lot faster since it uses libxml2 internally.

Here's a way using just normal string manipulation

html='''
<h1 class="title">Title One</h1><p><a href="#">40.5</a>
<a href="#">31.3</a></p>
<h1 class="title alternate">Title Two</h1><p><a href="#">12.1</a><a href="#">82.0</a></p>
'''

for i in html.split("</a>"):
    if "<a href" in i:
        print i.split("<a href")[-1].split(">")[-1]

output

$ python test.py
40.5
31.3
12.1
82.0

I don't actually understand what you want to get, but if your requirement is SIMPLE, yes, a regex or a few string mangling can do it. Not necessary need a parser for that.