开发者

Having trouble parsing website with regular expressions

开发者 https://www.devze.com 2023-01-26 16:16 出处:网络
I\'m trying to parse search results for WorldCat.org in order to fetch basic information about books and articles.

I'm trying to parse search results for WorldCat.org in order to fetch basic information about books and articles.

A typical search result (and the one I'm using for testing) can be found here: http://www.worldcat.org/search?q=ti%3Aorganizations&fq=dt%3Abks&qt=advanced&dblist=638

The html for that page is here: http://pastebin.com/w2U91F1i

Here is the regular expression I'm using with PHP preg_match_all to capture basic details about each entry:

$data = file_get_contents($url);
preg_match_all('/<div class="oclc_number">(.*?)<\/div>\n.*?<div class="name">\n.*?<a href="(.*?)"><stron开发者_JAVA技巧g>(.*?)<\/strong><\/a>\n.*?\n\n<div class="author">by\s(.*?)<\/div><div class="type">.*?<span class=\'itemType\'>(.*?)<\/span>.*?\n.*?<span class="itemLanguage">(.*?)<\/span>.*?<div class="type">Publication:\s*?(.*?)<\/div>/', $data, $topics, PREG_SET_ORDER);

When I use this expression with the regexr tool (http://gskinner.com/RegExr/) it works just fine (except I use \r instead of \n -- usually \r doesn't work for me). But preg_match_all gives me an empty array each time.

Any clues as to what I'm doing wrong?


Whenever I need to scrape HTML, I tend to use the Simple HTML DOM Parser library, which takes an HTML tree and parses it into a traversable PHP object, which you can query something like JQuery.


HTML is not a regular language, don't try to parse it with regular expressions!

Read the first answer here:

RegEx match open tags except XHTML self-contained tags

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号