I have a web bot which extracts so开发者_如何学Gome data from a website. The problem is that the html content is sent without line brakes so it's a little bit harder to match certain things so I need to extract everything that is between td tags. Here's a string example:
<a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a> (<b><font color="#3300cc">Used</font></b>)</td><td><a class="a" href="javascript:ow(19623507)">**-**-**-***.cstel.net</a> (<b><font color="#3300cc">Used</font></b>)</td>
And my regex so far:
<a\s+class="a"\s+href="javascript:ow\((.*?)\)">.+</a>(?!<td>).+</td>
But my regex matches the whole line instead of matching all contents. Any ideas?
Don't waste your time on regexes. Use DOM and XPath.
DOMDocument::loadHTML($html)->getElementsByTagName('a')
Have you tried changing .+
to .+?
?
Can you determine where the proper line breaks SHOULD be? If so, it might be easier to first replace those tokens with a proper line break and then use the pattern you have (assuming that pattern works - I haven't tried it).
Your pattern looks VERY specific, but perhaps it works fine for what you are doing.
精彩评论