开发者

Whats the best way to crawl a batch of urls for a specific html element and retrieve the image?

开发者 https://www.devze.com 2022-12-19 14:38 出处:网络
I\'m looking to crawl ~100 webpages that are of the same structure, but the image I require is of a different name in each instance.

I'm looking to crawl ~100 webpages that are of the same structure, but the image I require is of a different name in each instance.

The image tag is located at:

#content div.artwork img.ar开发者_如何转开发twork

and I need the src url of that result to be downloaded.

Any ideas? I have the urls in a .txt file, and am on a mac os x box.


I am not sure how you can utilize a 'selector' like query on the file but a Perl regex might do the job just as well:

for url in `cat urls.txt`; do wget -O- $url; done | \
  perl -nle 'print $1 if /<img.+?class="artwork".+?src="([^"]+)"/'
0

精彩评论

暂无评论...
验证码 换一张
取 消