I have a problem - a parser that does not parse. It does not work! It gives not back anything! Well, I want to get something back - and store the results in a mysql-database.
<?PHP
// Original PHP code by Chirp Internet: http://www.chirp.com.au
// Please acknowledge use of this code by including this header.
$url = "http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de";
//$input 开发者_Python百科= @file_get_contents($url) or die("Could not access file: $url");
$input = file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
{
foreach($matches as $match)
{
// $match[2] = all the data i want to collect...
// $match[3] = text that i need to collect - see a detail-page
}
}
?>
It goes a bit over my head: It does not give back any results. Do I have to use file_get_contents()
with a query string?
Works fine here:
$url = "http://www.edi.admin.ch/esv/00475/00698/index.html?lang=de";
$doc = new DOMDocument();
// Supress warnings for screwy HTML
@$doc->loadHTMLFile($url);
// Use DOM functionality to get all links
$link_list = $doc->getElementsByTagName('a');
$links = array();
foreach($link_list as $link) {
if($link->getAttribute('href')) {
// and put their href attributes and
// text content in an array
$link_info['href'] = $link->getAttribute('href');
$link_info['text'] = $link->nodeValue;
$links[] = $link_info;
}
}
print_r($links);
Output:
Array
(
[0] => Array
(
[href] => #webNavigationDiv
[text] => Direkt zur Navigation [Alt + 1]
)
[1] => Array
(
[href] => #contentStart
[text] => Direkt zum Inhalt [Alt + 2]
)
[2] => Array
(
[href] => #keywords_fast
[text] => Direkt zur Suche [Alt + 5]
)
You're doing something that you shouldn't – parsing HTML with regex. Don't do it!
Use DOM parsing functions instead. PHP's DOMDocument class is quite easy to use, and much more legible (and stable) than regex:
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
$links = $dom->getElementsByTagName('a');
$hrefs = array();
foreach ($links as $link) {
$hrefs[] = $link->getAttribute('href');
}
Getting other data, such as the text content or other attribute names, is trivially easy if you want to do so.
You can only use fopen-like functions with a url if the appropriate fopen wrapper is enabled.
See: http://www.php.net/manual/en/filesystem.configuration.php#ini.allow-url-fopen
While I would second the 'regex isn't good for html,' if this is just for a little script, who cares? That being said, DOMDocument and friends are easy enough to use.
Josh
精彩评论