Parsing and loading into Hive/Hadoop_问答_开发者

开发者 https://www.devze.com 2023-03-17 14:29 出处：网络

相关专题：hive mapreduce

i am new to hadoop map reduce framework, and I am thinking of using hadoop map reduce to parse my data. I have thousands of big delimited files for which I am thinking of writing a map reduce job to parse those files and load them into hive datawarehouse. I have written a parser in perl which can parse those files. But I am stuck at doing the same with Hadoop map reduce

For example: I have a file like x=a y=b z=c..... x=p y=q z=s..... x=1 z=2 .... and so on

Now I have to load this file as columns (x,y,z) in hive table, but I am not able to figure out can I proceed with it. Any guidance with this would be really helpful.

Another problem in doing this is there are some files where the field y is missing. I have to include that condition in the map reduce job. So far, I have tried using streaming.jar and giving my parser.pl as mapper as input to that jar file. I think that is not the way to do it :), but I was just trying if that would work. Also, I thought of using load function of Hive, but the missing column will create problem if I will specify regexserde in hive table.

开发者_JAVA百科

I am lost in this now, if any one could guide me with this I would be thankful :)

Regards, Atul

I posted something a while ago to my blog a while ago. (Google "hive parse_url" should be in the top few)

I was parsing urls but in this case you will want to use str_to_map.

str_to_map(arg1, arg2, arg3)

arg1 => String to process
arg2 => Key Value Pair separator
arg3 => Key Value separator

str = "a=1 b=42 x=abc"
str_to_map(str, " ", "=")

The result of str_to_map will give you a map<str, str> of 3 key-value pairs.

str_to_map(str, " ", "=")["a"] --will return "1"

str_to_map(str, " ", "=")["b"] --will return "42"

We can pass this to Hive via:

INSERT OVERWRITE TABLE new_table_with_cols_x_y_z
(select params["x"], params["y"], params["z"] 
 from (
   select str_to_map(raw_line," ","=") as params from data
 ) raw_line_from_data
) final_data