I am using simple html dom to find links on a certain page using:
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
This find all the links on the page, however i want to be able to go t开发者_C百科o found links as well and find links inside those found links recursively for example to level 5.
Any idea of how to go about?
Use a recursive function and keep track of the depth:
function findLinks($url, $depth, $maxDepth) {
// fetch $url and parse it
// ...
if ($depth <= $maxDepth)
foreach($html->find('a') as $element)
findLinks($element->href, $depth + 1, $maxDepth);
}
And you would start by calling something like findLinks($rootUrl, 1, 5)
.
In the past I did need a similar feature. What you can do is use mysql to store your links.
In my case I had a todo table and a pages table. Seed your todo table with some url's you want to spider.
What I used to do was to get the page info I need (plaintext and title) and store this in a mysql db pages. Then I used to loop through the links and add them to the todo table. The last step was to remove the current page from my todo list then loop over..
grab a url from todo loop
{
get current page title and plaintext store it in pages table
loop through links Add found links to todo table
remove current page from todo
}
精彩评论