i have a html document with n "a href" tags with different target urls and different text between the tag.
For example:
<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>
<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>
<a href="http://www.example.com/d.1234" name="example3">example3</a>
<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>
As you can see 开发者_如何转开发the target urls switch between "d?, d., d/d?, d/d." and between the "a tag" there could be any type of html which is allowed by w3c.
I need a Regex which gives me all links which has one of these combination in the target url: "d?, d., d/d?, d/d." and has "Lorem" or "test" between the "a tags" in any position including sub html tags.
My Regex so far:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>.*?</a>)
I tried to include the lorem / test as followed:
href=[\"\']([^>]*?/[d]+[.|\?][^"]*?[\"\'][^>]*[/]?>(lorem|test)+</a>)
but this will only works if I put a ".*?" before and after the (lorem|test) and this would be to greedy.
If there is a easier way with SimpleXml or any other DOM parser, please let me know. Otherwise I would appreciate any help with the regex.
Thanks!
Here you go:
$html = array
(
'<a href="http://www.example.com/d?12345abc" name="example"><span ....>lorem ipsum</span></a>',
'<a href="http://www.example.com/d/d?abc1234" name="example2"><span ....>example</span></a>',
'<a href="http://www.example.com/d.1234" name="example3">example3</a>',
'<a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>',
'<a href="http://www.example.com/without_d/1234" name="example3">without a d as target url</a>',
);
$html = implode("\n", $html);
$result = array();
$anchors = phXML($html, '//a[contains(., "lorem") or contains(., "test")]');
foreach ($anchors as $anchor)
{
if (preg_match('~d[.?]~', strval($anchor['href'])) > 0)
{
$result[] = strval($anchor['href']);
}
}
echo '<pre>';
print_r($result);
echo '</pre>';
Output:
Array
(
[0] => http://www.example.com/d?12345abc
[1] => http://www.example.com/d/d.1234
)
The phXML()
function is based on my DOMDocument / SimpleXML wrapper, and goes as follows:
function phXML($xml, $xpath = null)
{
if (extension_loaded('libxml') === true)
{
libxml_use_internal_errors(true);
if ((extension_loaded('dom') === true) && (extension_loaded('SimpleXML') === true))
{
if (is_string($xml) === true)
{
$dom = new DOMDocument();
if (@$dom->loadHTML($xml) === true)
{
return phXML(@simplexml_import_dom($dom), $xpath);
}
}
else if ((is_object($xml) === true) && (strcmp('SimpleXMLElement', get_class($xml)) === 0))
{
if (isset($xpath) === true)
{
$xml = $xml->xpath($xpath);
}
return $xml;
}
}
}
return false;
}
I'm too lazy not to use this function right now, but I'm sure you can get rid of it if you need to.
Here is a Regular Expression which works:
$search = '/<a\s[^>]*href=["\'](?:http:\/\/)?(?:[a-z0-9-]+(?:\.[a-z0-9-]+)*)\/(?:d\/)?d[?.].*?>.*?(?:lorem|test)+.*?<\/a>/i';
$matches = array();
preg_match_all($search, $html, $matches);
The only thing is it relies on there being a new-line character between each ` tag. Otherwise it will match something like:
<a href="http://www.example.com/d.1234" name="example3">example3</a><a href="http://www.example.com/d/d.1234" name="example4"><img ...>test</img></a>
Use an HTML parser. There are lots of reasons that Regex is absolutely not the solution for parsing HTML.
There's a good list of them here: Robust and Mature HTML Parser for PHP
Will print only first and fourth link because two conditions are met.
preg_match_all('#href="(.*?)"(.*?)>(.*?)</a>#is', $string, $matches);
$count = count($matches[0]);
unset($matches[0], $matches[2]);
for($i = 0; $i < $count; $i++){
if(
strpos($matches[1][$i], '/d') !== false
&&
preg_match('#(lorem|test)#is', $matches[3][$i]) == true
)
{
echo $matches[1][$i];
}
}
精彩评论