I am a beginning programmer trying to parse an HTML file in a Processing sketch. (Incidentally, if you don't know Processing, it compiles to Java and uses the same regex functions). I have correctly captured the HTML file as a single String using SimpleML. The data I'm trying to capture comes from a table, like so:
<th>Name</th>
<th>John F. Kennedy</th>
<th>Lyndon Johnson</th>
<th>Richard Nixon</th>
etc.
I want to parse out the names of candidates into an array (dropping the "Name").
So I first tried
candidates = match(rawString,"<th>.*</th>");
which returned the whole 开发者_C百科list.
Then I tried
candidates = match(rawString,"<th>.{1,50}</th>");
which returns only
<th>Name</th>
The Processing documentation says:
If there are groups (specified by sets of parentheses) in the regexp, then the contents of each will be returned in the array. Element [0] of a regexp match returns the entire matching string, and the match groups start at element [1] (the first group is [1], the second [2], and so on).
So now I've been trying various combinations of groups and quantifiers, like:
candidates = match(rawString,"(<th>.{1,50}</th>)*");
But there must be some conceptual piece I'm not getting, because nothing is working. Seems like this should be easy, right?
Parsing HTML with regular expression is usually not a good idea, but you might get by with it here.
Your problem appears to have been that .*
matches greedily, i. e. as many characters as possible, thereby matching everything from the very first <th>
to the very last </th>
in your string.
Making it lazy, i. e. telling the quantifier to match as little as possible is one solution:
<th>.*?</th>
would probably work.
A bit more stable and minimally faster: Tell the engine exactly what it's allowed to match, for example:
<th>[^<>]*</th>
[^<>]
means "any character except angle brackets".
You will be running into problems if you're ever trying to match nested structures with regular expressions. It can be done in modern regex flavors, but it's very hard to do right. Add HTML comments and strings to the mix (that might contain the very delimiters you're matching against) and you're in for a world of hurt.
You probably want the matchAll
method if you expect to match your expression multiple times. match
only expects your pattern to match once, so only returns the first found result.
http://www.processing.org/reference/matchAll_.html
精彩评论