creating a distributed crawling python app. it consists of a master server, and associated client apps that will run on client servers. the purpose of the client app is to run across a targeted site, to extract specific data. the clients need to go "deep" within the site, behind multiple levels of forms, so each client is specifically geared towards a given site.
each client app looks something like
main:
parse initial url
call function level1 (data1)
function level1 (data)
parse the url, for data1
use the required xpath to get the dom elements
call the next function
call level2 (data)
function level2 (data2)
parse the url, for data2
use the required xpath to get the dom elements
call the next function
call level3
function level3 (dat3)
parse the url, for data3
use the required xpath to get the dom elements
call the next function
call level4
function level4 (data)
parse the url, for data4
use the required xpath to get the dom elements
at the final function..
--all the data output, and eventually returned to the server
--at this point the data has elements from each function...
my question: given that the number of calls that is made to the child function b开发者_StackOverflow中文版y the current function varies, i'm trying to figure out the best approach.
each function essentialy fetches a page of content, and then parses
the page using a number of different XPath expressions, combined
with different regex expressions depending on the site/page.
if i run a client on a single box, as a sequential process, it'll
take awhile, but the load on the box is rather small. i've thought
of attempting to implement the child functions as threads from the
current function, but that could be a nightmare, as well as quickly
bring the "box" to its knees!
i've thought of breaking the app up in a manner that would allow
the master to essentially pass packets to the client boxes, in a
way to allow each client/function to be run directly from the
master. this process requires a bit of rewrite, but it has a number
of advantages. a bunch of redundancy, and speed. it would detect if
a section of the process was crashing and restart from that point.
but not sure if it would be any faster...
i'm writing the parsing scripts in python..
so... any thoughts/comments would be appreciated...
i can get into a great deal more detail, but didn't want to bore anyone!!
thanks!
tom
This sounds like a usecase for MapReduce on Hadoop.
Hadoop Map/Reduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. In your case, this would be a smaller cluster.
A Map/Reduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.
You mentioned that,
i've thought of breaking the app up in a manner that would allow the master to essentially pass packets to the client boxes, in a way to allow each client/function to be run directly from the master.
From what I understand, you want a main machine (box) to act as a master, and have client boxes that run other functions. For instance, you could run your main() function and parse the initial URLs on it. The nice thing is that you could parallelize your task for each of these URLs across different machines, since they appear to be independent of each other.
Since level4 depends on level3, which depends on level2 .. and so on, you can just pipe the output of each to the next rather than calling one from each.
For examples on how to do this, I would recommend checking out, in the given order, the following tutorials,
The Hadoop tutorial is a simple introduction and overview to what map-reduce is and how it works.
Michael Noll's tutorial on how to utilize Hadoop on top of Python (the basic concepts of Mapper and Reducer) in a simple way
And finally, a tutorial for a framework called Dumbo, released by the folks at Last.fm, which automates and builds on Michael Noll's basic example for use in a production system.
Hope this helps.
Take a look at the multiprocessing class. It allows you to set up a work queue and a pool of workers -- as you parse the page, you can spawn off tasks to be done by separate processes.
Check out the scrapy package. It will allow for easy creation of your "client apps" (a.k.a crawlers, spiders, or scrapers) that go "deep" into a website.
brool and viksit both have good suggestions for the distributed part of your project.
精彩评论