I'm doing some very simple data mining 开发者_开发百科(actually, just a wordcound) as my research project for undergrad program.
I'm going to use Amazon Elastic MapReduce.
I need to upload 4GB .xml file.
What is the best way to do it?
Upload small zip files and somehow unzip them in the bucket? Or split file, upload and then use all small files for streaming MapReduce job?
You should either put this xml into a sequencefile and bzip2 it, or bzip2 it and decompress it in the cloud.
If you want to upload one big file, S3 supports multi-part uploads. For more details start at the documentation page.
If the goal is to fit these data into EMR (Spark or Flink etc.), multiple compressed small files would be more desirable to leverage bit of the parallelism in loading, also e.g., EMR Spark can handle tar/zip compressed files from S3 by default.
精彩评论