开发者

crawling a html page using php?

开发者 https://www.devze.com 2023-01-20 00:40 出处:网络
This website lists over 250 courses in one list. I want to get the name of each course and insert that int开发者_Go百科o my mysql database using php. The courses are listed like this:

This website lists over 250 courses in one list. I want to get the name of each course and insert that int开发者_Go百科o my mysql database using php. The courses are listed like this:

<td> computer science</td>
<td> media studeies</td>
…

Is there a way to do that in PHP, instead of me having a mad data entry nightmare?


Regular expressions work well.

$page = // get the page
$page = preg_split("/\n/", $page);
for ($text in $page) {
    $matches = array();
    preg_match("/^<td>(.*)<\/td>$/", $text, $matches);
    // insert $matches[1] into the database
}

See the documentation for preg_match.


How to parse HTML has been asked and answered countless times before. While (for your specific UseCase) Regular Expressions will work, it is - in general - better and more reliable to use a proper parser for this task. Below is how to do it with DOM:

$dom = new DOMDocument;
$dom->loadHTMLFile('http://courses.westminster.ac.uk/CourseList.aspx');
foreach($dom->getElementsByTagName('td') as $title) {
    echo $title->nodeValue;
}

For inserting the data into MySql, you should use the mysqli extension. Examples are plentiful on StackOverflow. so please use the search function.


You can use this HTML parsing php library to achieve this :http://simplehtmldom.sourceforge.net/


I encountered the same problem. Here is a good class library called the html dom http://simplehtmldom.sourceforge.net/. This like jquery


Just for fun, here's a quick shell script to do the same thing.

curl http://courses.westminster.ac.uk/CourseList.aspx \
| sed '/<td>\(.*\)<\/td>/ { s/.*">\(.*\)<\/a>.*/\1/; b }; d;' \
| uniq > courses.txt
0

精彩评论

暂无评论...
验证码 换一张
取 消