开发者

Amazon MapReduce with cronjob + APIs

开发者 https://www.devze.com 2023-03-07 05:32 出处:网络
I have a website set up on an EC2 instance which lets users view info from 4 of their social networks.

I have a website set up on an EC2 instance which lets users view info from 4 of their social networks.

Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day.

Initially we had a cron-job which went through each user and did the necessary calls to the APIs and then stored the data on the DB (amazon rds instance).

This operation should take between 2 to 30 seconds per person, which means doing it 1 by 1 would take days to updat开发者_如何转开发e.

I was looking at MapReduce and would like to know if it would be a suitable option for what im trying to do, but at the moment I can't tell for sure.

Would I be able to give an .sql file to MapReduce, with all the records I want to update + a script that tells MapReduce what to do with each record and have it process them all simultaneously?

If not, what would be the best way to go about it?

Thanks for your help in advance.


I am assuming each user's data is independent of the other users' data, which seems logaical to me. If that-s not the case, please ignore this answer.

Since you have mutually independent data (that is, each user's data is independent from other users') there is no need to use MapReduce. MR is just a paradigm in programming that simplifies data manipulation when the data is not independent (map prepares the data, then there is sorting phase, then reduce pulls the results from the sorted records).

In your case, if you want to use more computers, just split the load between them - each computer should process ~10000 users per hour (very rough estimate). Then users can be distributed among computers beforehand or they can be requested in chunks of 1000 or so users, so the machines that end sooner can process more users.

BUT there is an added bonus in using MR framework (such as Hadoop), even if you only use one phase (map only). It does the error handling for you (nodes failing, jobs failing,...) and it takes care of distributing the input among the nodes.

I'm not sure if MR is worth all the trouble to set it up, depends on your previous experience - YMMV.


If my understanding is correct. should this application to be implement as MapReduce, all the processings are done in the Map phase and reduce might simple output the Map phase result. So if I were to implement this, I would just divide the job into multiple EC2 instances with each instance process a given range of record in your sql data. This has made the assumption that you have an good idea of how to divide the data to different instances. The advantage is that you needn't pay for the price of Elastic MapReduce and avoid any possible MapReduce overhead.

0

精彩评论

暂无评论...
验证码 换一张
取 消