开发者

Problem with TXT file extraction in ruby

开发者 https://www.devze.com 2023-03-01 13:42 出处:网络
I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code

I have data file as in format of TXT , I like to parse the URL field from TXT file using the below ruby code

f = File.open(txt_file, "r")
f.each_line { |line|
  rows = line.split(',')
  rows[3].each do |url|
    next if url=="URL"
开发者_JS百科    puts url
  end
}

TXT contains:

name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"

output:

0

Why does the output come from the option field "0,0,0,0,0,0"? How do I skip this and get the URL field?

Environment ruby 1.8.7 rails 2.3.8 gem 1.3.7


I'd check out a CSV parsing tool to make this easier:

 require 'rubygems'
 require 'faster_csv'

 FasterCSV.foreach(txt_file, :quote_char => '"', 
        :col_sep =>',', :row_sep =>:auto) do |row|
   puts row[3] if row[3] != "URL"
   break
 end

Also, I think you're misunderstanding how the split() would work. If you run split() against one row from your file, you're going to get back an array of columns for that single row, not a multidimensional array as rows[3].each would suggest.


EDIT: Before reading, I completely agree with the answer by Jeff Swensen, I'll leave my answer here regardless.

I'm not entirely sure what your inside loop is for (rows[3].each) Because you can't convert a single line into a 'row' when you only have a single URL. You could split by the ** characters and return an Array of urls but then you still need to remove the extra double quotes, or you could use a Regular Expression, like so:

#!/usr/bin/env ruby

f = DATA
urls = f.readlines.map do |line|
  line[/([^"]+)"\*\*/, 1] 
end
urls.compact!

p urls

__END__
name ,option,price, **URL**
"x", "0,0,0,0,0,0", "123.40",**"http://domain.com/xym.jpg"**
"x", "0,0,0,0,0,0", "111.34",**"http://domain.com/yum.jpg"**

The call to compact is needed because map will insert nil objects when you hit something that doesn't match that expression. For the String#[] method, see here


The reason that "0" is the result is that your code is blindly splitting on the comma char when you seem to be expecting parsing CSV-style (where column values may contain delimiter chars if the entire column value is enclosed in quotes. I highly suggest using a csv parser. If you are using Ruby 1.9.2, then you will already have access to the FasterCSV library.


If you are sure that the fields you want are always surrounded by double quotations, you can use that as the basis for extracting rather than the comma.

File.open(txt_file) do |f|
  f.each_line do |l|
    cols = l.scan(/(?<!\\)"(.*?)(?<!\\)"/)
    cols[3].tap{|url| puts url if url}
  end
end
  • In your code, the opened IO is not closed. This is a bad practice. It is better to use a block so that you do not forget to close it.
  • The two (?<!\\)" in the regex match non-escaped double quotations. They use negative lookbehind.
  • .*? is a non-greedy match, which avoids a match from exceeding a non-escaped double quotation.
  • tap is to avoid repeating the cols[3] operation twice in puts and if.

Edit again

If you use ruby 1.8.7, you can either

  • update your regex engine to oniguruma by following easy steps here, http://oniguruma.rubyforge.org/

or

  • replace the regex. tap cannot be used also. Use the following instead:

.

File.open(txt_file) do |f|
  f.each_line do |l|
    cols = l.scan(/(?:\A|[^\\])"(.*?[^\\]|)"/)
    url = cols[3]
    puts url if url
  end
end

I would recomment using oniguruma. It is a new regex engine introduced since ruby 1.9, and is much powerful and faster than the one used in ruby 1.8. It can be installed easily on ruby 1.8.


The data is in CSV format, but if all you want to do is grab the last field in the string, then do just that:

text =<<EOT
name,option,price,URL
"x", "0,0,0,0,0,0", "123.40","http://domain.com/xym.jpg"
"x", "0,0,0,0,0,0", "111.34","http://domain.com/yum.jpg"
EOT

require 'pp'
text.lines.map{ |l| l.split(',').last }

If you want to clean up the double-quotes and trailing line-breaks:

text.lines.map{ |l| l.split(',').last.gsub('"', '').chomp }
# => ["URL", "http://domain.com/xym.jpg", "http://domain.com/yum.jpg"]
0

精彩评论

暂无评论...
验证码 换一张
取 消