For the sake of this question, I'll include a basic example of what I'm trying to do. I have been looking for a method using regex which would allow me to have an input such as this:
<a>$4<br>.00</a>
To match this in one sub-group 4.00
I have tried numerous methods, all being around the lines of:
<a>\$([0-9]+<br>\.[0-9]+)</a>
or
<a>\$([0-9]+(?:<br>)\.[0-9]+)</a>
^-- Excludes <br> from being 开发者_如何学Cplaced in a match group, but it does not
exclude <br> from its parent match group, so we still get 4<br>.00
Both of the methods above match 4<br>.00
My question is: Are there any other Regex operators that allow me to exclude certain sub-expressions from their parent sub-expressions? (Match 4<br>.00
but exclude <br>
giving 4.00
in 1 sub-group)
Is there a replace function in whatever language that is? Something along the lines of:
s.replaceAll( "<.+>", "" )
So that would replace all the tags in your string with the empty string and leave you with what you want.
If you want to use regex, you don't have to really do it in one step. you can break it up into steps. Eg: Get the text from to and save to variable using /<a>(.*?)<\/a>/
. then replace the tags
>>> import re
>>> s="<a>$4<br>.00</a>"
>>> re.sub("<a>(.*?)<\/a>","\\1",s)
'$4<br>.00'
>>> var=re.sub("<a>(.*?)<\/a>","\\1",s)
>>> re.sub("<.*?>","",var)
'$4.00'
I decided to switch to using lxml. Even for minimal HTML parsing needs, lxml did the trick.
精彩评论