开发者

Building a hash of hashes to determine largest numerical value in ruby

开发者 https://www.devze.com 2023-01-25 08:09 出处:网络
I have a data file that looks like this: FBpp0070000 acyr193594273 acyr 866 FB开发者_Go百科pp0070000 acyr193577824 acyr 536

I have a data file that looks like this:

FBpp0070000 acyr193594273 acyr 866
FB开发者_Go百科pp0070000 acyr193577824 acyr 536
FBpp0070000 acyr193693009 acyr 445
FBpp0070000 bomb193605819 bomb 503
FBpp0070000 bomb193676398 bomb 101
FBpp0070001 acyr193618043 acyr 316
FBpp0070001 acyr193617997 acyr 313
FBpp0070001 bomb193638865 bomb 482
FBpp0070001 locu193695159 locu 220
FBpp0070001 locu193638863 locu 220

The data file is ~ 45,000 lines long.

My goal is to have this:

FBpp0070000 acyr193594273 acyr 866
FBpp0070000 bomb193605819 bomb 503
FBpp0070001 acyr193618043 acyr 316
FBpp0070001 bomb193638865 bomb 482
FBpp0070001 locu193695159 locu 220

That is to say, keep only those lines with the highest score in column 4, for each different value in column 3, for each value in column 1.

Additionally, the problems I am seeing are 1) multiple, duplicate "keys" in column 1 and 2) equal "scores" in column 4; I want to only keep one instance of that duplicate "score".

I have, in the past, built a hash in perl which can handle multiple duplicate keys.

Here is what I have in ruby so far.

hash = Hash.new{|h,k| h[k]=Hash.new(&h.default_proc) }  
title = ''

File.open('test1.txt', 'r').each do |line|
  line.chomp!

     query, hit, taxa, score = line.split(/\s/)
     hash[query][hit][taxa] = score

 # end

#p "#{query}: #{taxa}: #{score}"

end
p hash

So, I am hoping someone could help me to determine 1) if I am, indeed, going about this correctly, and 2) if so, how to extract the lines I need.

Thanks.


The following seems to do what you want, given the input example you gave above. You'll need to resort data at the end to get the output format you want.

#!/usr/bin/env ruby

require 'pp'

data = {}
File.open("input.txt", "r").each do |l| 
  l.chomp!
  query, hit, taxa, score = l.split(/\s+/)
  data[query] ||= {}
  data[query][taxa] ||= [0, nil]
  data[query][taxa] = [score.to_i, hit] if score.to_i > data[query][taxa].first
end 

pp data

This gives:

dj2@Magnus:~/Development/test $ ./out.rb 
{"FBpp0070000"=>
  {"bomb"=>[503, "bomb193605819"], "acyr"=>[866, "acyr193594273"]},
 "FBpp0070001"=>
  {"bomb"=>[482, "bomb193638865"],
   "locu"=>[220, "locu193695159"],
   "acyr"=>[316, "acyr193618043"]}}
0

精彩评论

暂无评论...
验证码 换一张
取 消