开发者

Want to compare two consecutive jobs on Hadoop

开发者 https://www.devze.com 2023-02-28 16:15 出处:网络
I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exa

I want to know if I can compare two consecutive jobs in Hadoop. If not I would appreciate if anyone can tell me how to proceed with that. To be precise, I want to compare the jobs in terms of what exactly two jobs did? The reason behind doing this is to create a statistics about how many jobs executed on Hadoop were similar in terms of the behavior. For example how many times same sorting function was executed on the same input.

For example if first job did something like SortList(A) and some other job did SortList(A)+Group(re开发者_Python百科sult(SortList(A)). Now, I am wondering if in Hadoop there is some mapping being stored somewhere like JobID X-> SortList(A).

So far, I thought of this problem as finding the entry point in Hadoop and try to understand how job is created and what information is being kept with a jobID and in what form (in a code form or some description) , but I was not able to figure it out successfully.


Hadoop's Counters might be a good place to start. You can define your own counter names (like each counter name is a data set you are working on) and increment that counter each time you perform a sort on it. Finding which data set you are working on, however, may be the more difficult task.

Here's a tutorial I found: http://philippeadjiman.com/blog/2010/01/07/hadoop-tutorial-series-issue-3-counters-in-action/


No. Hadoop jobs are just programs. They can have any side effects. They can write ordinary files, hdfs file, or a database. Nothing in hadoop is recording all of their activities. All hadoop is manage the schedule and the flow of data.

0

精彩评论

暂无评论...
验证码 换一张
取 消