I have to fetch website details as search engine does. I need the description of the site,link 开发者_StackOverflow中文版and some info about them and will store it in my DB. Is there any libraries available for doing this? Please remember I can crawl a whole webpage but I need only the information in the format crawled by search engines.
Thanks,
KarthikWhich language? APIs and bindings exist for reading webpage content. Do you realize the scale of the task if you wish to create a new 'search engine'? Your question is so generic, there's not a lot of advice that can be given, other than:
Respect robots.txt
Don't hammer the server with requests, you'll soon get your IP blocked by sensible sysadmins.
精彩评论