开发者

How to extract h1 headings from an HTML page using regular expressions?

开发者 https://www.devze.com 2023-02-11 05:13 出处:网络
I\'m still trying to get to grips with regexps and I\'m considering a simple query. I\'m trying parse the homepage of my website and extract the H1 tags.

I'm still trying to get to grips with regexps and I'm considering a simple query. I'm trying parse the homepage of my website and extract the H1 tags.

  <?php
    $string_get = file_get_contents("http://davidelks.com/");
    
    
    $replace = "$1";
    
    $matches = preg_replace ("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/", $replace, $string_get, 1);
    
    $string_construct = "Mum " . $matches .  " Dad";
    
    echo 开发者_运维知识库($string_construct);
    
    ?>

However, instead of just displaying the first HTML link using the $1 token, it just pulls in the whole page. What can I try next?


This looks like something that could be done easily with a DOM parser:

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->load('http://davidelks.com/');
$h1 = $dom->getElementsByTagName('h1')->item(0);
echo $h1->textContent;

You should get:

Let's make things happen in and around Stoke-on-Trent

Note: I'm not sure if this is your site or a site you manage, but there shouldn't be more than a single <h1> tag in a HTML page (there is a couple on the homepage).


The mistake is in your usage of preg_replace. You wanted to extract something, for which preg_match is to be used:

<?php
 $text = file_get_contents("http://davidelks.com/");

 preg_match('#<h1 class="title"><a href="([\w\s\x21\/\-\.\£\:]*)">([^<>]*)</a></h1>#', $text, $match);

 echo "Mum " . $match[1] .  " Dad";
?>

Note particularily that you can combine character classes. You don't need [A-Z]|[a-z]|[..] because you can just combine it into one [A-Za-z...] square bracket list.

Also try to use single quotes for the PHP string, if you want to search double quotes within. This saves a lot of extraneous escaping. As do alternative enclosures # instead of / around the regex.


It would be easier using a DOM parser. But if you would want to do it with regex you should use the preg_match_all function in php:

preg_match_all("/<h1 class=\"title\"><a href=\"([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*\">([A-Z]|[0-9]|[a-z]|[\s]|[\x21]|[\/]|[\-]|[\.]|[\£]|[\:])*<\/a><\/h1>/",$string_get,$matches);
var_dump($matches);
0

精彩评论

暂无评论...
验证码 换一张
取 消