I have a large malformed test HTML document which I need to get the numbers out of:
I'd like to get the primary ratio out. I'm using this regular expression:
(?<=Primary ratio</TD><TD>--</TD><TD>).*(?=</TD>)
On this string:
Primary ratio</TD><TD>--</TD><TD>10.52</TD><TD>14.97</TD><TD></TD></TR><TR align='right'><TD align='left'>Flip Ratio</TD><TD>-122.81</TD><TD>1.13</TD><TD>1.50</TD><TD></TD></TR><TR align='right'><TD align='left'>Secondary Ratio</TD><TD>--</TD><TD>0.70</TD><TD>0.70</TD><TD></TD></TR><TR align='right'><TD a开发者_如何学编程lign='left'>RM Ratio</TD><TD>--</TD><TD>2.02</TD>
But I get this as a result:
10.52</TD><TD>14.97</TD><TD></TD></TR><TR align='right'><TD align='left'>Flip Ra
tio</TD><TD>-122.81</TD><TD>1.13</TD><TD>1.50</TD><TD></TD></TR><TR align='right
'><TD align='left'>Secondary Ratio</TD><TD>--</TD><TD>0.70</TD><TD>0.70</TD><TD>
</TD></TR><TR align='right'><TD align='left'>RM Ratio</TD><TD>--</TD><TD>2.02
I don't want that, I just want the 10.52 number in the first tag.
I mean, it found the start of the string perfectly, but it didn't find the first . What am I doing wrong?
Replace .*
with .*?
near the end of your regex; that should stop it from matching too much. Normally it'll much as much as possible that fits the pattern, by adding the ?
, you ask it to match as little as possible instead.
Use an HTML parser instead of a RegEx - the HTML Agility Pack is a good one.
In general, regular expressions are not suitable for usage with HTML, as HTML is not a regular language. This is particularly true if you are working with HTML from different sources. See here for a compelling demonstration.
精彩评论