开发者

preg_match pick URL from other site

开发者 https://www.devze.com 2022-12-18 21:03 出处:网络
I want to pick all directory URLs from this site. I did the pregmatch, but it retrieves the entire site URL, it means unnecess开发者_JAVA百科ary URL links also.

I want to pick all directory URLs from this site.

I did the pregmatch, but it retrieves the entire site URL, it means unnecess开发者_JAVA百科ary URL links also.

Rendering, here is my code.

How do get all the submission links from that site?


I tried running this and it seems to work, only changed the regex

<?php
for($i=0;$i<=25;$i++){
    $site_url = "http://www.directorymaximizer.com/index.php?pageNum_directory_list=$i";
    $preg_math =  file_get_contents($site_url);
    $regex = '@-->(https?://[^<]*)<\!--@'; 
    preg_match_all($regex, $preg_math, $matches, PREG_PATTERN_ORDER); 

    foreach($matches as $key=>$val){
    if($val!="" && !is_numeric($val)){
        foreach(array_unique($val) as $key1=>$val1){
            if( $val1!="" && !is_numeric($val1)){

             echo $val1;
             echo "<br />\n";

            }
        }   
    }
}
}


You'll want a HTML parser for that. HTML is irregular, so regular expressions don't work well.


To use a regular expression for this you need some consistent delimiters. Thankfully, the URLs you want - and only those you want - seem look like this in source:

target="_blank">-->the url is here<!--</a>-->

Meaning the regular expression you'd want is:

@target="_blank">-->(?P<url>.+?)<!--</a>-->@

Where matches from the first capture group, indexed under "url", will contain the - surprise - URLs. Why the named capture group? Just seems easier to figure out what it is you're doing when you look back at your code.


I have a nifty little tool for you to make regular expression keys with.

Go check out RegExr at gskinner.com.

Additionally I believe this is the pattern your looking for. For an anchor to be matched it must have a full URL including the domain. I will output the URL, domain, and path in an array. See below.

preg_match('/http:\/\/(?P[a-z0-9/]+\.[\w]+)(?P[\/\?\w\.=\&]+)?)[\s\w="]+>/', $site, $anchors);

$url = $anchors['url'];
$domain = $anchors['domain'];
$path = $anchors['path'];

Let me know how it goes. I did not test this, so I apologize if there is an error.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号