I need some guidelines on how to detect the headline and content of crawled pages. I've been seeing some very weird fro开发者_如何学JAVAnt-end codework since i started working on this crawler.
You could try the Simple HTML DOM Parser. It sports a syntax to find specific elements similar to jQuery.
They have an example on how to scrape Slashdot:
// Create DOM from URL
$html = file_get_html('http://slashdot.org/');
// Find all article blocks
foreach($html->find('div.article') as $article) {
$item['title'] = $article->find('div.title', 0)->plaintext;
$item['intro'] = $article->find('div.intro', 0)->plaintext;
$item['details'] = $article->find('div.details', 0)->plaintext;
$articles[] = $item;
}
print_r($articles);
精彩评论