I'm trying to parse search results for WorldCat.org in order to fetch basic information about books and articles.
A typical search result (and the one I'm using for testing) can be found here: http://www.worldcat.org/search?q=ti%3Aorganizations&fq=dt%3Abks&qt=advanced&dblist=638
The html for that page is here: http://pastebin.com/w2U91F1i
Here is the regular expression I'm using with PHP preg_match_all to capture basic details about each entry:
$data = file_get_contents($url);
preg_match_all('/<div class="oclc_number">(.*?)<\/div>\n.*?<div class="name">\n.*?<a href="(.*?)"><stron开发者_JAVA技巧g>(.*?)<\/strong><\/a>\n.*?\n\n<div class="author">by\s(.*?)<\/div><div class="type">.*?<span class=\'itemType\'>(.*?)<\/span>.*?\n.*?<span class="itemLanguage">(.*?)<\/span>.*?<div class="type">Publication:\s*?(.*?)<\/div>/', $data, $topics, PREG_SET_ORDER);
When I use this expression with the regexr tool (http://gskinner.com/RegExr/) it works just fine (except I use \r instead of \n -- usually \r doesn't work for me). But preg_match_all gives me an empty array each time.
Any clues as to what I'm doing wrong?
Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.
HTML is not a regular language, don't try to parse it with regular expressions!
Read the first answer here:
RegEx match open tags except XHTML self-contained tags
精彩评论