Scrape with wildcards and php_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-03 00:07 出处：网络

I have a hard time visualizing and conceiving away to scrape this page: http://www.morewords.com/ends-with/aw for the words themselves. Given a URL, I\'d like to get the contents and then generate a p

I have a hard time visualizing and conceiving away to scrape this page: http://www.morewords.com/ends-with/aw for the words themselves. Given a URL, I'd like to get the contents and then generate a php array with all the words listed, which in the source look like

<a href="/word/word1/">word1</a><br />
<a href="/word/word2/">word2</a><br />
<a href="/word/word3/">word3</a><br />
<a href="/word/word4/">word4</a><br />

There are a few ways I have been thinking about doing this, i'd appreciate if you could help me decide the most efficient way. Also, i'd appreciate any advice or examples on how to achieve this. I understand it's not incredibly compli开发者_如何学JAVAcated, but I could use the help of you advanced hackers.

Use some sort of jquery $.each() to loop through and somehow case them into a JS array, and then transcribe (probably heavily taxing)
use some sort of curl (don't really have much experience with curl)
use some sophisticated find and replace with regex.

You tagged it as PHP, so here is a PHP solution :)

$dom = new DOMDocument;

$dom->loadHTMLFile('http://www.morewords.com/ends-with/aw');

$anchors = $dom->getElementsByTagName('a');

$words = array();

foreach($anchors as $anchor) {
    if ($anchor->hasAttribute('href') AND preg_match('~/word/\w+/~', $anchor->getAttribute('href'))) {
        $words[] = $anchor->nodeValue;
    }
}

CodePad.

If allow_url_fopen is disabled in php.ini, you could use cURL to get the HTML.

$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://www.morewords.com/ends-with/aw'); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$html = curl_exec($curl);    
curl_close($curl);