开发者

How do I get the URLs out of an HTML file?

开发者 https://www.devze.com 2023-02-10 18:17 出处:网络
I need to开发者_StackOverflow get a long list of valid URLs for testing my DNS server.I found a web page that has a ton of links in it that would probably yield quite a lot of good links (http://www.c

I need to开发者_StackOverflow get a long list of valid URLs for testing my DNS server. I found a web page that has a ton of links in it that would probably yield quite a lot of good links (http://www.cse.psu.edu/~groenvel/urls.html), and I figured that the easiest way to do this would be to download the HTML file and simply grep for the URLs. However, I can't get it to list out my results with only the link.

I know there are lots of ways to do this. I'm not picky how it's done.

Given the URL above, I want a list of all of the URLs (one per line) like this:

http://www.cse.psu.edu/~groenvel/

http://www.acard.com/

http://www.acer.com/

...


Method 1

Step1:

wget "http://www.cse.psu.edu/~groenvel/urls.html"

Step2:

perl -0ne 'print "$1\n" while (/a href=\"(.*?)\">.*?<\/a>/igs)' /PATH_TO_YOUR/urls.html | grep 'http://' > /PATH_TO_YOUR/urls.txt

Just replace the "/PATH_TO_YOUR/" with your filepath. This would yield a text file with only urls.

Method 2

If you have lynx installed you could simply do this in 1 step:

Step1:

lynx --dump http://www.cse.psu.edu/~groenvel/urls.html | awk '/(http|https):\/\// {print $2}' > /PATH_TO_YOUR/urls.txt

Method 3

Using curl:

Step1

curl http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o  "(http|https):.*\">" | awk  'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt

Method 4

Using wget:

wget -qO- http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o  "(http|https):.*\">" | awk  'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt


you need wget, grep, sed. I will try a solution and update my post later.

Update:

wget [the_url];

cat urls.html | egrep -i '<a href=".*">' | sed -e 's/.*<A HREF="\(.*\)">.*/\1/i'
0

精彩评论

暂无评论...
验证码 换一张
取 消