I am a newbie to Hadoop. I have managed to develop a simple Map/Reduce application that works fine in 'pseudo distributed mode'.I want to test that in 'fully distributed mode'. I have few questions regarding that;
- How many machines(nodes) do I need (minimum & recommended) for processing a file size of 1-10GB?
- what are the hardware requir开发者_运维问答ements(mainly, I want to know the # of cores, Memory space and disk space)?
I'd check out Cloudera's hardware recommendations: http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/
A snippet from that page
Various hardware configurations for different workloads, including our original “base” recommendation:
- Light Processing Configuration (1U/machine): Two quad core CPUs, 8GB memory, and 4 disk drives (1TB or 2TB). Note that CPU-intensive work such as natural language processing involves loading large models into RAM before processing data and should be configured with 2GB RAM/core instead of 1GB RAM/core.
- Balanced Compute Configuration (1U/machine): Two quad core CPUs, 16 to 24GB memory, and 4 disk drives (1TB or 2TB) directly attached using the motherboard controller. These are often available as twins with two motherboards and 8 drives in a single 2U cabinet.
- Storage Heavy Configuration (2U/machine): Two quad core CPUs, 16 to 24GB memory, and 12 disk drives (1TB or 2TB). The power consumption for this type of machine starts around ~200W in idle state and can go as high as ~350W when active.
- Compute Intensive Configuration (2U/machine): Two quad core CPUs, 48-72GB memory, and 8 disk drives (1TB or 2TB). These are often used when a combination of large in-memory models and heavy reference data caching is required.
精彩评论