开发者

Apache Pig: Extra query parameters from web log

开发者 https://www.devze.com 2023-02-16 16:13 出处:网络
I am working on analyzing AWS CloudFront access logs. I have the code to load the lines of the file raw_logs2 =LOAD \'file:///home/ec2-user/ENWRZAC68E00M.2011-02-28-18.72jA8eGh\'

I am working on analyzing AWS CloudFront access logs.

I have the code to load the lines of the file

    raw_logs2 =LOAD 'file:///home/ec2-user/ENWRZAC68E00M.2011-02-28-18.72jA8eGh'
  USING PigStorage('\t')
  AS (
    date: chararray, time: chararray, x_edge_location: chararray, sc_bytes: int,
    c_ip: chararray, cs_method: chararray, cs_host: chararray, cs_uri_stem: chararray开发者_JAVA技巧,
    sc_status: chararray, cs_referer: chararray, cs_user_agent:chararray, cs_uri_query: chararray
  );

Now I am trying to parse the query string parameters(name/value pairs):

p=searchresults&s=homesforsale&gad=&gci=FOUNTAIN%2520VALLEY&gst=CA&gzi=&k=fountainvalleyca&ts=1298918206&

How can I add an additional columns to my raw_logs2 table for the values of p,s and gci in the query string?


One quick way to do it is to use REGEX_EXTRACT_ALL:

raw_logs = 
  GENERATE
    *, 
    FLATTEN(REGEX_EXTRACT_ALL(cs_uri_query, 'p=(.+?)&s=(.+?)&.+?gci=(.+?)&.+?')) 
      AS (p:CHARARRAY, s:CHARARRAY, gci:CHARARRAY);`
0

精彩评论

暂无评论...
验证码 换一张
取 消