Simple problem with regex pattern_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-19 09:10 出处：网络

please help me get the link and text from this tag. <h3 class=\"post-title entry-title\"> has to be included because I want the links from that specific tag.

相关专题：php regex

please help me get the link and text from this tag. <h3 class="post-title entry-title"> has to be included because I want the links from that specific tag.

<h3 class="post-title entry-title"&开发者_Go百科gt;
<a href="http://mymplogk.blogspot.com/2011/03/h_25.html">Text</a>
</h3>

my work so far is

<?php

$string = file_get_contents('http://www.domain.com');

$regex_pattern = "";

unset($matches);
preg_match_all($regex_pattern, $string, $matches);


foreach ($matches[0] as $paragraph) {
echo $paragraph;
echo "<br>";
}
?>

Thank you in advance

Don't use regex to parse HTML. It's a bad idea. Use an HTML/XML parser. Since you are using PHP, you can try using PHP Tidy or DOMDocument. It will make your life much easier.

Following your example, this regex will find "http://mymplogk.blogspot.com/2011/03/h_25.html" and "Text":

$regex_pattern = '/<h3[^>]+class\s*=\s*[\'"]post-title entry-title[\'"][^>]*>.*?<a[^>]+href\s*=\s*"([^"]+)"[^>]*>([^<]*)</s';

This matches single or double quotes around the h3 tag, and allows additional attributes in h3 tag and optional whitespace between attributes and values. It also matches multiple times in $string, e.g.

$string = '<h3 class="post-title entry-title">
<a href="http://mymplogk.blogspot.com/2011/03/h_25.html">Text</a>
</h3>
<p>doot</p>
<h3 class=\'post-title entry-title\'>
<a href="http://www.google.com/">More Text</a>
</h3>';

I would recomend you to use DOMDocument and XPath to extract the url from the page instead of using regexp.

This tutorial gives you some starters how to use xpath and dom. http://www.merchantos.com/blog/makebeta/php/scraping-links-with-php#php_dom

edit: If you use firebug-addon in firefox, you can inspect your element on the page, and copy it's xpath.

The regex: