开发者

Simple problem with regex pattern

开发者 https://www.devze.com 2023-02-19 09:10 出处:网络
please help me get the link and text from this tag. <h3 class=\"post-title entry-title\"> has to be included because I want the links from that specific tag.

please help me get the link and text from this tag. <h3 class="post-title entry-title"> has to be included because I want the links from that specific tag.

<h3 class="post-title entry-title"&开发者_Go百科gt;
<a href="http://mymplogk.blogspot.com/2011/03/h_25.html">Text</a>
</h3>

my work so far is

<?php

$string = file_get_contents('http://www.domain.com');

$regex_pattern = "";

unset($matches);
preg_match_all($regex_pattern, $string, $matches);


foreach ($matches[0] as $paragraph) {
echo $paragraph;
echo "<br>";
}
?> 

Thank you in advance


Don't use regex to parse HTML. It's a bad idea. Use an HTML/XML parser. Since you are using PHP, you can try using PHP Tidy or DOMDocument. It will make your life much easier.


Following your example, this regex will find "http://mymplogk.blogspot.com/2011/03/h_25.html" and "Text":

$regex_pattern = '/<h3[^>]+class\s*=\s*[\'"]post-title entry-title[\'"][^>]*>.*?<a[^>]+href\s*=\s*"([^"]+)"[^>]*>([^<]*)</s';

This matches single or double quotes around the h3 tag, and allows additional attributes in h3 tag and optional whitespace between attributes and values. It also matches multiple times in $string, e.g.

$string = '<h3 class="post-title entry-title">
<a href="http://mymplogk.blogspot.com/2011/03/h_25.html">Text</a>
</h3>
<p>doot</p>
<h3 class=\'post-title entry-title\'>
<a href="http://www.google.com/">More Text</a>
</h3>';


I would recomend you to use DOMDocument and XPath to extract the url from the page instead of using regexp.

This tutorial gives you some starters how to use xpath and dom. http://www.merchantos.com/blog/makebeta/php/scraping-links-with-php#php_dom

edit: If you use firebug-addon in firefox, you can inspect your element on the page, and copy it's xpath.


The regex:

(?<=href=").+(?=")

Should match anything in between href tags

You can test this in RegexStorm

0

精彩评论

暂无评论...
验证码 换一张
取 消