开发者

hadoop beginners question

开发者 https://www.devze.com 2022-12-23 18:51 出处:网络
I\'ve read some documentation about hadoop and seen the impressive results.I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I\'m ea

I've read some documentation about hadoop and seen the impressive results. I get the bigger picture but am finding it hard whether it would fit our setup. Question isnt programming related but I'm eager to get opinion of people who currently work with hadoop and how it would fit our setup:

  • We use Oracle for backend
  • Java (S开发者_StackOverflow中文版truts2/Servlets/iBatis) for frontend
  • Nightly we get data which needs to be summarized. this runs as a batch process (takes 5 hours)

We are looking for a way to cut those 5 hours to a shorter time.

Where would hadoop fit into this picture? Can we still continue to use Oracle even after hadoop?


The chances are you can dramatically reduce the elapsed time of that batch process with some straightforward tuning. I offer this analysis on the simple basis of past experience. Batch processes tend to be written very poorly, precisely because they are autonomous and so don't have irate users demanding better response times.

Certainly I don't think it makes any sense at all to invest a lot of time and energy re-implementing our application in a new technology - no matter how fresh and cool it may be - until we have exhausted the capabilities of our current architecture.

If you want some specific advice on how to tune your batch query, well that would be a new question.


Hadoop is designed to parallelize a job across multiple machines. To determine whether it will be a good candidate for your setup, ask yourself these questions:

  • Do I have many machines on which I can run Hadoop, or am I willing to spend money on something like EC2?

  • Is my job parallelizable? (If your 5 hour batch process consists of 30 10-minute tasks that have to be run in sequence, Hadoop will not help you).

  • Does my data require random access? (This is actually pretty significant - Hadoop is great at sequential access and terrible at random access. In the latter case, you won't see enough speedup to justify the extra work / cost).

As far as where it "fits in" - you give Hadoop a bunch of data, and it gives you back output. One way to think of it is like a giant Unix process - data goes in, data comes out. What you do with it is your business. (This is of course an overly simplified view, but you get the idea.) So yes, you will still be able to write data to your Oracle database.


Hadoop distributed filesystem supports highly paralleled batch processing of data using MapReduce.

So your current process takes 5 hours to summarize the data. Of the bat, general summarization tasks are one of the 'types' of job MapReduce excels at. However you need to understand weather your processing requirements will translate into a MapReduce job. By this I mean, can you achieve the summaries you need using the key/value pairs MapReduce limits you to using?

Hadoop requires a cluster of machines to run. Do you have hardware to support a cluster? This usually comes down to how much data you are storing on the HDFS and also how fast you want to process the data. Generally when running MapReduce on a Hadoop the more machines you have either the more data you can store or the faster you run a job. Having an idea of the amount of data you process each night would help a lot here?

You can still use Oracle. You can use Hadoop/MapReduce to do the data crunching and then use custom code to insert the summary data into an oracle DB.

0

精彩评论

暂无评论...
验证码 换一张
取 消