Im new in web crawling. I'm going to build a search engine which the crawler saves Rapidshare links including URL where that Rapidshare links found...
In other words, I'm going to build a website similar to filestube.com
After some searching, I've found Scrapy works with Django. I've tried to find about 开发者_如何学编程nutch integration with Django, but found nothing
I hope you can give me suggestion for building this kind of website... especially the crawler
The best known pluggable app for that is Django-Haystack which allows you to connect to several search backends :
- Solr / Lucene the buzzword-compliant Apache foundation project
- Whoosh a native python search library
- Xapian another very good semantic search engine
haystack allows you to use an API which looks like Django's own Queryset syntax to use directly these search engines (which all happens to have their own API and dialects).
If you're juste after scraping tools, whatever tool you'll use : BeautifulSoup or Scrappy, you'll be on your own, writing python code that will parse what you want to parse, and then populate your django models.
This can even be separate python scripts , available in the commands.py module.
If you have a lot of files to search, you will probably need an index, which is rebuilt frequently and allows fast searches without hitting the django ORM.
Using a Solr index (for example) enables you to create other fields on-the-fly, like virtual fields based on your real model's fields (ex : splitting author firstname and lastname, adding an uppercased file title field, whatever)
Of course, f you don't need speedy indexation, keyword boost or semantic analysis, you still can do a classic full-text search over a couple of django model fields i :
- Django native QuerySet see the "__search('something')" field lookup
- PostGreSQL-specific full text search with Django
Have you checked DjangoItem? It's an experimental Scrapy feature, but it's known to work
精彩评论