Hi i have csv file 开发者_JAVA技巧which contains company url list like this www.google.com,www.ibm.com.....
Here i want to get contactus or aboutus page url (example http://www.google.com/contact) for each url which is present in csv file i have one idea checking the links with the following patterns (contact us, about us, about, locations).
If you do not find any of those, flag the url and write it into a log file. If you find the pattern, just print the address (it is used for some other process)
I'd suggest using Beautiful Soup to parse the page. Another alternative would be to setup a HIT on Mechanical Turk.
scrapy is the best. The best thing about scrapy is that it is a open source. scrapy documentation
精彩评论