I'm working with a number of large files that contain matrices of data corresponding to nasa's MODIS grid -- the grid splits the earth's surface up into a 21,600 x 43,200 pixel array. This particular dataset gives one integer value per pixel.
I have about 200 files, one file per month, and need to create a time series for each pixel.
My question is, for a map task that takes one of these files -- should I cut the grid up into chunks of, say, 24,000 pixels, and emit those as values (with location and time period as keys), or simply emit a key, value pair for every single pixel, treating a pixel like a word in the canonical word count example?
The chunking will work fine, it just introduces an arbitrary "chunk size" variable into my program. My feeling is that this will save quite a bit of time on IO,开发者_开发知识库 but it's just a feeling, and I look forward to actual informed opinions!
In a Hadoop project I worked on I can confirm that the number of K,V pairs has a direct impact on the load, CPU time and IO. If you can limit the number of chunks and still retain enough scalability for your situation I would certainly try to go there.
精彩评论