i am currently in dire need of news articles to test a LSI implementation (it's in a foreign language, so there isnt the usual packs of files ready to use).
So i need a crawler that given a starting url, let's say http://news.bbc.co.uk/ follows all the contained links and saves their content into .txt files, if we could specify the format to be UTF8 i would be in heaven.
I have 0 expertise in this ar开发者_如何学运维ea, so i beg you for some sugestions in which crawler to use for this task.
What you are looking for is a "Scraper", and you will have to write one. Further more you may be in violation of the BBC's Terms of Use like anyone cares.
you can grab the site with wget
. Then run it all through some HTML renderer (Lynx
text browser does the job adequately with --dump-html
option) to convert HTML to TXT. You will need to write the script to call Lynx on each downloaded file yourself, but that should be easy enough.
精彩评论