开发者

Crawl website using wget and limit total number of crawled links

开发者 https://www.devze.com 2023-02-10 04:41 出处:网络
I want to learn more about crawlers by playing around with the wget tool. I\'m interested in crawling my department\'s website, and finding the first 100 links on that site. So far, t开发者_StackOverf

I want to learn more about crawlers by playing around with the wget tool. I'm interested in crawling my department's website, and finding the first 100 links on that site. So far, t开发者_StackOverflow社区he command below is what I have. How do I limit the crawler to stop after 100 links?

wget -r -o output.txt -l 0 -t 1 --spider -w 5 -A html -e robots=on "http://www.example.com"


You can't. wget doesn't support this so if you want something like this, you would have to write a tool yourself.

You could fetch the main file, parse the links manually, and fetch them one by one with a limit of 100 items. But it's not something that wget supports.

You could take a look at HTTrack for website crawling too, it has quite a few extra options for this: http://www.httrack.com/


  1. Create a fifo file (mknod /tmp/httpipe p)
  2. do a fork
    • in the child do wget --spider -r -l 1 http://myurl --output-file /tmp/httppipe
    • in the father: read line by line /tmp/httpipe
    • parse the output =~ m{^\-\-\d\d:\d\d:\d\d\-\- http://$self->{http_server}:$self->{tcport}/(.*)$}, print $1
    • count the lines; after 100 lines just close the file, it will break the pipe
0

精彩评论

暂无评论...
验证码 换一张
取 消