开发者

python regex retrieve only one group

开发者 https://www.devze.com 2023-01-16 18:49 出处:网络
I have juste a little experience with the regex, and now I have a little problem. I must retrieve the strings between the .

I have juste a little experience with the regex, and now I have a little problem.

I must retrieve the strings between the .

So here is a sample :

Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>

And this is my little regex:

re.findall("""<a href=".*">.*</a>""",string)

Well, it works , but 开发者_运维技巧I just want the strings between the , not the href, so how could I do this ?

thanks.


Use parentheses to form a capturing group:

'<a href=".*">(.*)</a>'

You also probably want to use a non-greedy quantifier to avoid matching far more than you intended.

'<a href=".*?">(.*?)</a>'

Result:

['2', 'nissan', 'all']

Or even better, consider using an HTML parser, such as BeautifulSoup.


Regex is never a good idea for parsing HTML. There are too many edge cases that make crafting a robust regular expression difficult. Consider the following perfectly browser-viewable links:

< a href="/car/all/page1.html">all</a>
<a  href="/car/all/page1.html">all</a>
<a href= "/car/all/page1.html">all</a>
<a id="foo" href="/car/all/page1.html">all</a>
<a
 href="/car/all/page1.html">all</a>

All of which will not be matched by the given regular expression. I highly recommend an HTML parser, such as Beautiful Soup or lxml. Here's an lxml example:

from lxml import etree

html = """
Categories: <a href="/car/2/page1.html">2</a>, <a href="/car/nissan/">nissan</a>,<a href="/car/all/page1.html">all</a>
"""
doc = etree.HTML(html)
result = doc.xpath('//a[@href]/text()')

Result:

['2', 'nissan', 'all']

no matter if the HTML is different or even somewhat malformed.

0

精彩评论

暂无评论...
验证码 换一张
取 消