I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.
My site has meta tags like this (each page different):
<meta name="cla开发者_如何转开发n_name" content="Dark Mage" />
So what I'm doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don't think it matters.
I need to shift through the string to find 'Dark Mage' and store it in a variable (so i can put into sql)
Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.
Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.
<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName('meta');
foreach ($meta as $data) {
$name = $meta->getAttribute('name');
if ($name == 'clan_name') {
$content = $meta->getAttribute('content');
// TODO handle content for clan_name
}
}
?>
EDIT If you want to remove certain tags (such as <script>
) before you load your HTML string into memory, try using the strip_tags()
function. Something like this will keep only the meta tags:
<?
$html = strip_tags($html, '<meta>');
?>
Use a regular expression like the following, with PHP's preg_match():
/<meta name="clan_name" content="([^"]+)"/
If you're not familiar with regular expressions, read on.
The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.
The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:
[^"]
means "match any character that is not a double-quote".
The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:
[^"]+
means "match one or more characters that are not double-quotes".
Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:
([^"]+)
means "match one or more characters that are not double-quotes and store them as a matched subpattern.
In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.
So, assuming your HTML page is in the variable "$page", the following code:
$matches = array();
$found = preg_match('/<meta name="clan_name" content="([^"]+)"/', $page, $matches);
if ($found) {
$clan_name = $matches[1];
}
Should get you what you want.
Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/
精彩评论