开发者

Huge files in hadoop: how to store metadata?

开发者 https://www.devze.com 2023-02-12 00:00 出处:网络
I have a use case to 开发者_Go百科upload some tera-bytes of text files as sequences files on HDFS.

I have a use case to 开发者_Go百科upload some tera-bytes of text files as sequences files on HDFS.

These text files have several layouts ranging from 32 to 62 columns (metadata).

What would be a good way to upload these files along with their metadata:

  1. creating a key, value class per text file layout and use it to create and upload as sequence files ?

  2. create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?

Any inputs are appreciated !

Thanks


I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.

You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.

As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.

As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.


If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.


The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.

For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.

0

精彩评论

暂无评论...
验证码 换一张
取 消