I want to scrape a website, let's say CNN, every hour and add any titles in an H1 into a new row into my MYSQL table. How do I do that?
I don't expect anyone to do the whole work for you, but here's something to get you started.
First of all, you need to get the actual source, you can use file_get_contents or curl for this. There's plenty on information about how around here.
Then you need to scrape CNN for all H1-tags. A simple way to do this is to use DOMDocument. Here is a simple function to get all headings from a HTML source:
function get_h1($html) {
$dom = new DOMDocument();
@$dom->loadHTML($html); // Supress warnings if our html is not well formed
$headings = $dom->getElementsByTagName("h1");
$retval = array();
foreach($headings as $header) {
$retval[] = $header->nodeValue;
}
return $retval;
}
Note that this does not account for different encodings etc.
Another option for parsing is to use the excellent PHP Simple HTML DOM Parser.
You then need to save it into your database, you can use the mysqli or PDO libraries for this.
Lastly, you need to run this hourly. Do this using a cron-jobs. You can find information about how to set up your cron jobs here.
This should help to get you started. You probably want to add some more features to this, like ensuring your not adding duplicate headings etc.
You know, this has me curious. I was just playing around with NodeJS. I bet server-side JQuery and AJAX could really knock something like this out in a flash. Not sure about connecting to the database though, but the parsing would be a cake walk.
精彩评论