开发者

Is it possible to build this type of program in PHP?

开发者 https://www.devze.com 2023-02-02 12:31 出处:网络
I want to build a QA program that will crawl all the pages of a site (all files under a specified domain name), and it will return all external links on the site that doesn\'t open in a new window (do

I want to build a QA program that will crawl all the pages of a site (all files under a specified domain name), and it will return all external links on the site that doesn't open in a new window (does not have the target="_blank" attribute in the href).

I can make a php or javascript to open external links in new windows or to report all problem links that don't open in new windows of a single page (the same page the script is in) but what I want is for the QA tool to go and search all pages of a website and report back to me what it finds.

This "spidering" i开发者_运维技巧s what I have no idea how to do, and am not sure if it's even possible to do with a language like PHP. If it's possible how can I go about it?


Yes, it is. You can use any function like fopen/fread or even file_get_contents to read the HTML of a given URL to a string, and then you can use DOMDocument::loadHTML to parse it, and DOMXPath to get a list of all <a> elements and their attributes (target, href).


yes its very much possible to do it using php.

try using curl to get the page and regex, more specifically preg_match_all function to filter the links

More on curl here: PHP: cURL - Manual More on regex here: PHP: preg_match_all - Manual


regex's are likely to fail / turn up false positives. Use PHP's DomDocument class and/or xpath to find links on a given page.

http://us.php.net/manual/en/book.dom.php http://php.net/manual/en/class.domxpath.php


http://www.phpclasses.org/package/5439-PHP-Crawl-a-site-and-retrieve-the-the-URL-of-all-links.html Provides a class to crawl / spider a site and retrieve the the URL of all links. You can modify the script to check, if the page is valid using curl or file_get_content (as mentioned above).

0

精彩评论

暂无评论...
验证码 换一张
取 消