I need to extract some values from a multi-line string (which I read from the text body of emails). I want to be able to feed patterns to my parser so I can customize different emails later. I came up with the following:
#!/usr/bin/env ruby
text1 =
<<-eos
Lorem ipsum dolor sit amet,
Name: Pepe Manuel Periquita
Email: pepe@manuel.net
Sisters: 1
Brothers: 3
Children: 2
Lorem ipsum dolor sit amet
eos
pattern1 = {
:exp => /Name:[\s]*(.*?)$\s*
Email:[\s]*(.*?)$\s*
Sisters:[\s]*(.*?)$\s*
Brothers:[\s]*(.*?)$\s*
Children:[\s]*(.*?)$/mx,
:blk => lambda do |m|
开发者_高级运维 m.flatten!
{:name => m[0],
:email => m[1],
:total => m.drop(2).inject(0){|sum,item| sum + item.to_i}}
end
}
# Scan on text returns
#[["Pepe Manuel Periquita", "pepe@manuel.net", "1", "3", "2"]]
def do_parse text, pattern
data = pattern[:blk].call(text.scan(pattern[:exp]))
puts data.inspect
end
do_parse text1, pattern1
# ./text_parser.rb
# {:email=>"pepe@manuel.net", :total=>6, :name=>"Pepe Manuel Periquita"}
So, I define the pattern as a regular expression paired with a block to build a hash from the matches. The "parser" simply takes the text and apply the rules by executing the block on the result of matching the regular expression against the text with scan.
At the moment I have to parse emails with a format as shown in text1 but later I would like to add patterns as easily as possible to extract data from different emails (the format of those emails will be fixed for each type). Therefore I would like to simplify the pattern moving as much as possible to the "parser". The code above works and extracts the data but most of the work is located at the pattern...
Is this is the right way to go?
Could be simplified or would you think a different / better solution for this problem?
Update
I updated the parser following Tonttu solution so the pattern hash is now:
pattern2 = {
:exp => /^(.+?):\s*(.+)$/,
:blk => lambda do |m|
r = Hash[m.map{|x| [x[0].downcase.to_sym, x[1]]}]
{:name => r[:name],
:email => r[:email],
:total => r[:children].to_i + r[:brothers].to_i + r[:sisters].to_i}
end
}
Maybe something like this is generic enough?
pp Hash[*text1.scan(/^(.+?):\s(.+)$/).map{|x|
[x[0].downcase.to_sym, x[1]]
}.flatten]
=>
{:sisters=>"1",
:brothers=>"3",
:children=>"2",
:name=>"Pepe Manuel Periquita",
:email=>"pepe@manuel.net"}
精彩评论