I want to pick all directory URLs from this site.
I did the pregmatch, but it retrieves the entire site URL, it means unnecess开发者_JAVA百科ary URL links also.
Rendering, here is my code.
How do get all the submission links from that site?
I tried running this and it seems to work, only changed the regex
<?php
for($i=0;$i<=25;$i++){
$site_url = "http://www.directorymaximizer.com/index.php?pageNum_directory_list=$i";
$preg_math = file_get_contents($site_url);
$regex = '@-->(https?://[^<]*)<\!--@';
preg_match_all($regex, $preg_math, $matches, PREG_PATTERN_ORDER);
foreach($matches as $key=>$val){
if($val!="" && !is_numeric($val)){
foreach(array_unique($val) as $key1=>$val1){
if( $val1!="" && !is_numeric($val1)){
echo $val1;
echo "<br />\n";
}
}
}
}
}
You'll want a HTML parser for that. HTML is irregular, so regular expressions don't work well.
To use a regular expression for this you need some consistent delimiters. Thankfully, the URLs you want - and only those you want - seem look like this in source:
target="_blank">-->the url is here<!--</a>-->
Meaning the regular expression you'd want is:
@target="_blank">-->(?P<url>.+?)<!--</a>-->@
Where matches from the first capture group, indexed under "url", will contain the - surprise - URLs. Why the named capture group? Just seems easier to figure out what it is you're doing when you look back at your code.
I have a nifty little tool for you to make regular expression keys with.
Go check out RegExr at gskinner.com.
Additionally I believe this is the pattern your looking for. For an anchor to be matched it must have a full URL including the domain. I will output the URL, domain, and path in an array. See below.
preg_match('/http:\/\/(?P[a-z0-9/]+\.[\w]+)(?P[\/\?\w\.=\&]+)?)[\s\w="]+>/', $site, $anchors);
$url = $anchors['url'];
$domain = $anchors['domain'];
$path = $anchors['path'];
Let me know how it goes. I did not test this, so I apologize if there is an error.
精彩评论