开发者

Locating specific string and capturing data following it

开发者 https://www.devze.com 2022-12-11 03:22 出处:网络
I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.

I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.

My site has meta tags like this (each page different):

<meta name="cla开发者_如何转开发n_name" content="Dark Mage" />

So what I'm doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don't think it matters.

I need to shift through the string to find 'Dark Mage' and store it in a variable (so i can put into sql)

Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.


Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.

<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName('meta');
foreach ($meta as $data) {
  $name = $meta->getAttribute('name');
  if ($name == 'clan_name') {
    $content = $meta->getAttribute('content');
    // TODO handle content for clan_name
  }
} 
?>

EDIT If you want to remove certain tags (such as <script>) before you load your HTML string into memory, try using the strip_tags() function. Something like this will keep only the meta tags:

<?
  $html = strip_tags($html, '<meta>');
?>


Use a regular expression like the following, with PHP's preg_match():

/<meta name="clan_name" content="([^"]+)"/

If you're not familiar with regular expressions, read on.

The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.

The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:

[^"]

means "match any character that is not a double-quote".

The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:

[^"]+

means "match one or more characters that are not double-quotes".

Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:

([^"]+)

means "match one or more characters that are not double-quotes and store them as a matched subpattern.

In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.

So, assuming your HTML page is in the variable "$page", the following code:

$matches = array();
$found = preg_match('/<meta name="clan_name" content="([^"]+)"/', $page, $matches);

if ($found) {
    $clan_name = $matches[1];
}

Should get you what you want.


Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号