I'm trying to obtain the keywords from an HTML page that I'm scraping with PHP.
So, if the keywords tag looks like this:
<meta name="Keywords" content="MacUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary">
I want to get this back:
Ma开发者_Python百科cUpdate, Mac Software, Macintosh Software, Mac Games, Macintosh Games, Apple, Macintosh, Software, iphone, ipod, Games, Demos, Shareware, Freeware, MP3, audio, sound, macster, napster, macintel, universal binary
I've constructed a regex, but it's not doing the trick.
(?i)^(<meta name=\"keywords\" content=\"(.*)\">)
Any ideas?
I would use a HTML/XML parser like DOMDocument and XPath to retrieve the nodes from the DOM:
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$keywords = $xpath->query('//meta[translate(normalize-space(@name), "KEYWORDS", "keywords")="keywords"]/@content');
foreach ($keywords as $keyword) {
echo $keyword->value;
}
The translate
function seems to be necessary as PHP’s XPath implementation does not know the lower-case
function.
Or you do the filtering with PHP:
$metas = $xpath->query('//meta');
foreach ($metas as $meta) {
if ($meta->hasAttribute("name") && trim(strtolower($meta->getAttribute("name")))=='keywords' && $meta->hasAttribute("content")) {
echo $meta->getAttribute("content")->value;
}
}
Use the function get_meta_tags();
Tutorial
Stop using regex. It's slow, resource intensive, and not very nimble.
If you're programming in PHP check out http://simplehtmldom.sourceforge.net/ - SimpleDom is powerful enough to get you everything you need in a very simple object-oriented way.
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
Another example -
// Example
$html = str_get_html("<div>foo <b>bar</b></div>");
$e = $html->find("div", 0);
echo $e->tag; // Returns: " div"
echo $e->outertext; // Returns: " <div>foo <b>bar</b></div>"
echo $e->innertext; // Returns: " foo <b>bar</b>"
echo $e->plaintext; // Returns: " foo bar"
(.*) matches everything up to the LAST "(quote) in the document, obviously not what you want. Regex is greedy by default. You need to use
content=\"(.*?)\"
or
content=\"([^\"]*)\"
Stop trying to parse HTMl with regular expressions.
RegEx match open tags except XHTML self-contained tags
(?i)<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">
Would produce something like:
preg_match('~<meta\\s+name=\"keywords\"\\s+content=\"(.*?)\">~i', $html, &$matches);
This is a simple regex, that matches the first meta keywords tag. It only allows characters, numbers, legal URL characters, HTML entities and spaces to appear inside the content attribute.
$matches = array();
preg_match("/<meta name=\"Keywords\" content=\"([\w\d;,\.: %&#\/\\\\]*)\"/", $html, $matches);
echo $matches[1];
精彩评论