开发者

Extract values from a text body in Ruby

开发者 https://www.devze.com 2023-02-06 23:13 出处:网络
I need to extract some values from a multi-line string (which I read from the text body of emails). I want to be able to feed patterns to my parser so I can customize different emails later. I came up

I need to extract some values from a multi-line string (which I read from the text body of emails). I want to be able to feed patterns to my parser so I can customize different emails later. I came up with the following:

#!/usr/bin/env ruby

text1 = 
<<-eos
Lorem ipsum dolor sit amet, 

Name: Pepe Manuel Periquita

Email: pepe@manuel.net

Sisters: 1
Brothers: 3
Children: 2

Lorem ipsum dolor sit amet
eos

pattern1 = {
  :exp => /Name:[\s]*(.*?)$\s*
          Email:[\s]*(.*?)$\s*
          Sisters:[\s]*(.*?)$\s*
          Brothers:[\s]*(.*?)$\s*
          Children:[\s]*(.*?)$/mx,
  :blk => lambda do |m|
 开发者_高级运维   m.flatten!
    {:name => m[0],
     :email => m[1],
     :total => m.drop(2).inject(0){|sum,item| sum + item.to_i}}
  end
}

# Scan on text returns 
#[["Pepe Manuel Periquita", "pepe@manuel.net", "1", "3", "2"]]

  def do_parse text, pattern
    data = pattern[:blk].call(text.scan(pattern[:exp]))

    puts data.inspect
  end


do_parse text1, pattern1

# ./text_parser.rb
# {:email=>"pepe@manuel.net", :total=>6, :name=>"Pepe Manuel Periquita"}

So, I define the pattern as a regular expression paired with a block to build a hash from the matches. The "parser" simply takes the text and apply the rules by executing the block on the result of matching the regular expression against the text with scan.

At the moment I have to parse emails with a format as shown in text1 but later I would like to add patterns as easily as possible to extract data from different emails (the format of those emails will be fixed for each type). Therefore I would like to simplify the pattern moving as much as possible to the "parser". The code above works and extracts the data but most of the work is located at the pattern...

Is this is the right way to go?

Could be simplified or would you think a different / better solution for this problem?

Update

I updated the parser following Tonttu solution so the pattern hash is now:

pattern2 = {
  :exp => /^(.+?):\s*(.+)$/,
  :blk => lambda do |m|
    r = Hash[m.map{|x| [x[0].downcase.to_sym, x[1]]}]

    {:name => r[:name],
     :email => r[:email],
     :total => r[:children].to_i + r[:brothers].to_i + r[:sisters].to_i}
  end
}


Maybe something like this is generic enough?

pp Hash[*text1.scan(/^(.+?):\s(.+)$/).map{|x|
     [x[0].downcase.to_sym, x[1]]
   }.flatten]

=>
{:sisters=>"1",
 :brothers=>"3",
 :children=>"2",
 :name=>"Pepe Manuel Periquita",
 :email=>"pepe@manuel.net"}
0

精彩评论

暂无评论...
验证码 换一张
取 消