With PHP Curl, I want to scrape H1s into a database [closed]_问答_开发者

With PHP Curl, I want to scrape H1s into a database [closed]

开发者 https://www.devze.com 2023-02-03 01:32 出处：网络

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical andcannot be reasonably answered in its current form. For help clari

相关专题：curl php

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clari开发者_运维问答fying this question so that it can be reopened, visit the help center. Closed 12 years ago.

I want to scrape a website, let's say CNN, every hour and add any titles in an H1 into a new row into my MYSQL table. How do I do that?

I don't expect anyone to do the whole work for you, but here's something to get you started.

First of all, you need to get the actual source, you can use file_get_contents or curl for this. There's plenty on information about how around here.

Then you need to scrape CNN for all H1-tags. A simple way to do this is to use DOMDocument. Here is a simple function to get all headings from a HTML source:

function get_h1($html) {
    $dom = new DOMDocument();
    @$dom->loadHTML($html); // Supress warnings if our html is not well formed
    $headings = $dom->getElementsByTagName("h1");

    $retval = array();

    foreach($headings as $header) {
        $retval[] = $header->nodeValue;
    }

    return $retval;
}

Note that this does not account for different encodings etc.

Another option for parsing is to use the excellent PHP Simple HTML DOM Parser.

You then need to save it into your database, you can use the mysqli or PDO libraries for this.

Lastly, you need to run this hourly. Do this using a cron-jobs. You can find information about how to set up your cron jobs here.

This should help to get you started. You probably want to add some more features to this, like ensuring your not adding duplicate headings etc.

You know, this has me curious. I was just playing around with NodeJS. I bet server-side JQuery and AJAX could really knock something like this out in a flash. Not sure about connecting to the database though, but the parsing would be a cake walk.