I am confused between these terms. They somehow looks same to me. Can someone please Explain me the steps in which order they perform and which libraries can do the work. To me its all look the same.
I want to know at each step what is the input and what is the output e,g
Crawling
Input = URL
Output = ?
Indexing
Input = ?
I'll give you a general discription, algorithmically, make the modifications to your python libs.
Crawling: starting from a set of URLs and its goal is to expand the set's size, it actually follows out links and try to expand the graph as much as it can (until it covers the net-graph connected to the initial set of URLs or until resources [usually time] expires).
so:
input = Set of URLs
output = bigger set of URLs which are reachable from the input
Indexing: using the data the crawlers gathered to "index" the files. index is actually a list that maps each term (usually word) in the collection to the documents that this term appears in.
input:set of URLs
output: index file/library.
Search: use the index to search for relevant documents to a given query.
input: a query (String) and the index [usually it is an implicit argument, since its part of the state..]
output: relevant documents to the query (documents is actually a web site here, that was crawled...)
I encourage you to have a look at PyLucene which do all of these things (and more!)... and read some more about Information Retrieval
You should also check out Scrapy, a django app:
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
It crawls the sites and extracts the data of interest, which you can specify using xpath across the site periodically, and saves it to the database as a new version.
精彩评论