Apache Hadoop : Can it do "time-varying" input?_问答_开发者

I haven't found an answer to this even after a bit of googling. My input files are generated by a process which chunks them out at say, when the file touches 1GB. Now, if I were to run a mapreduce job, which processes an input directory in the dfs, how do I make sure that this job picks up the开发者_如何学Go files added to the same input directory, while the hadoop job is running?

I have a feeling this is close to impossible, coz when a hadoop job runs, it would calculate remaining time and all those stuff, so when my input keeps on piling or in other terms is "variable", Hadoop wouldn't know how to manage it - this is my guess. I would like to know your take on this and also on the best possible alternate ways to this! Appreciate your help.

You're describing a use-case that Hadoop was not designed to handle. Hadoop scans the input directory and determines the splits even before the map/reduce functions are run. So if more data has been added after the splits have been determined, you're out of luck.

It seems like you need a more real-time processing system. Hadoop is designed for batch oriented processes. I'm not sure exactly what your data processing requirements are, so it's hard to recommend a solution. Maybe micro-batching and running your Hadoop job more often on less amounts of data might help?

Architecturally speaking, Hadoop can handle this but you need to build some front end (or work with a few that are Open Source) and allow for Hadoop to-do its job.

Like any good system Hadoop cannot and should not do everything but you have some options to explore.

If you spend a little amount of time and develop some scripts with a database (or queue) behind them you can solve this problem yourself fairly quickly (assuming you can write something in Ruby or Python and occasionally call a bash script then this is very simple and even if you are using Java the complexity is not much more than mixing the bash script with an out layer of Ruby or Python).

Step 1: Files roll (based on your parameter [1GB or whatever] into directory /holding and an insert of the file being "rolled" is inserted to a table (or queue)... if you can not insert when roll then you can scan the directory (via cron) and move the file to a new directory and insert to db there name and location.

Step 2: cron (on whatever time frame you want, say once per hour) another script to go to the database (or queue) and get ALL of the files you want to MapReduce.

Step 3: In the script of Step 2 create a loop on the files you find and on multiple threads (or if you are using Ruby better to fork) and push these files into Hadoop. I say push because the method could be a simple "hadoop df -put" (where you can use a bash script calling from the ruby or python script)...or some custom jar file loader depending on what you need... You may want another table to keep the files as relating to some job but I leave that to you.

Step 4: run the job (either from a third script making your tables have some concept of events or simple as the last line after you have pushed the files to Hadoop) and get your output to-do what you want.

Open Source Options

You can use Oozie http://yahoo.github.com/oozie/releases/2.2.1/ which is Workflow solution for Hadoop Open Sourced by Yahoo that you might find some use too but it all depends how much you will be getting out of the effort you put in. For what you are doing it sounds like some effort in a custom set of scripts is the way to automate your workflow.... but take a look at Oozie.

Another Workflow for Hadoop is Azkaban http://sna-projects.com/azkaban/

Lastly, you can consider using a streaming architecture to move your files to HDFS... Now a days there are 3 methods (Kafka is new and was just released a few days back with more queing behind its core architecture than the other two)

1) Flume https://github.com/cloudera/flume/wiki

2) Scribe HDFS http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

3) Kafka http://sna-projects.com/kafka/