I want to build a QA program that will crawl all the pages of a site (all files under a specified domain name), and it will return all external links on the site that doesn't open in a new window (does not have the target="_blank" attribute in the href).
I can make a php or javascript to open external links in new windows or to report all problem links that don't open in new windows of a single page (the same page the script is in) but what I want is for the QA tool to go and search all pages of a website and report back to me what it finds.
This "spidering" i开发者_运维技巧s what I have no idea how to do, and am not sure if it's even possible to do with a language like PHP. If it's possible how can I go about it?
Yes, it is. You can use any function like fopen/fread or even file_get_contents to read the HTML of a given URL to a string, and then you can use DOMDocument::loadHTML to parse it, and DOMXPath to get a list of all <a> elements and their attributes (target, href).
yes its very much possible to do it using php.
try using curl
to get the page and regex
, more specifically preg_match_all
function to filter the links
More on curl here: PHP: cURL - Manual More on regex here: PHP: preg_match_all - Manual
regex's are likely to fail / turn up false positives. Use PHP's DomDocument class and/or xpath to find links on a given page.
http://us.php.net/manual/en/book.dom.php http://php.net/manual/en/class.domxpath.php
http://www.phpclasses.org/package/5439-PHP-Crawl-a-site-and-retrieve-the-the-URL-of-all-links.html Provides a class to crawl / spider a site and retrieve the the URL of all links. You can modify the script to check, if the page is valid using curl or file_get_content (as mentioned above).
精彩评论