I need to开发者_StackOverflow get a long list of valid URLs for testing my DNS server. I found a web page that has a ton of links in it that would probably yield quite a lot of good links (http://www.cse.psu.edu/~groenvel/urls.html), and I figured that the easiest way to do this would be to download the HTML file and simply grep for the URLs. However, I can't get it to list out my results with only the link.
I know there are lots of ways to do this. I'm not picky how it's done.
Given the URL above, I want a list of all of the URLs (one per line) like this:
http://www.cse.psu.edu/~groenvel/
http://www.acard.com/ http://www.acer.com/ ...
Method 1
Step1:
wget "http://www.cse.psu.edu/~groenvel/urls.html"
Step2:
perl -0ne 'print "$1\n" while (/a href=\"(.*?)\">.*?<\/a>/igs)' /PATH_TO_YOUR/urls.html | grep 'http://' > /PATH_TO_YOUR/urls.txt
Just replace the "/PATH_TO_YOUR/" with your filepath. This would yield a text file with only urls.
Method 2
If you have lynx installed you could simply do this in 1 step:
Step1:
lynx --dump http://www.cse.psu.edu/~groenvel/urls.html | awk '/(http|https):\/\// {print $2}' > /PATH_TO_YOUR/urls.txt
Method 3
Using curl:
Step1
curl http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o "(http|https):.*\">" | awk 'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt
Method 4
Using wget:
wget -qO- http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o "(http|https):.*\">" | awk 'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt
you need wget, grep, sed. I will try a solution and update my post later.
Update:
wget [the_url];
cat urls.html | egrep -i '<a href=".*">' | sed -e 's/.*<A HREF="\(.*\)">.*/\1/i'
精彩评论