开发者

Which web crawler to use to save news articles from a website into .txt files?

开发者 https://www.devze.com 2022-12-20 13:16 出处：网络

i am currently in dire need of news articles to test a LSI implementation (it\'s in a foreign language, so there isnt the usual packs of files ready to use).

相关专题：web-crawler

i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use).

So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the format to be UTF8 i would be in heaven.

I have 0 expertise in this ar开发者_如何学运维ea, so i beg you for some sugestions in which crawler to use for this task.

What you are looking for is a "Scraper", and you will have to write one. Further more you may be in violation of the BBC's Terms of Use like anyone cares.

you can grab the site with wget. Then run it all through some HTML renderer (Lynx text browser does the job adequately with --dump-html option) to convert HTML to TXT. You will need to write the script to call Lynx on each downloaded file yourself, but that should be easy enough.