开发者

Using heuristics to understand table columns

开发者 https://www.devze.com 2023-02-08 06:16 出处:网络
I have a set of semicolon separated values extracted from a table and I need to process this data. Unfortunately, I don\'t know which column has what information.

I have a set of semicolon separated values extracted from a table and I need to process this data.

Unfortunately, I don't know which column has what information.

Here's some sample data:

5036;MORAN;68681J;FBOP;40;872,7;OUT
5037;MORAN;68624J;GFBOP;40;872,7;OUT
5038;ITAKHOOLI;12124J;TGFOP;40;712,7;245
5039;ITAKHOOLI;12183J;GBOP;40;872,7;125
5040;ITAKHOOLI;12294J;GBOP;40;832,7;125
5041;ITAKHOOLI;12421J;GBOP;32;792,7;125
5042;NYA GOGRA;87230K;TGFOP;20;732,7;OUT

I know a few things about the data like:

  • The column with the longest strings are the company name
  • The columns with words GBOP, FBOP, etc are the category ( i have a list of them)
  • The column with the sequential numbers is the lot number
  • etc.

If I knew which column would conta开发者_如何转开发in what, writing the script would be easy but I don't. The other problem is that the order of the data can change — the one i gave above was just an example. Is there a tool/gem which would allow me to write some rules to make sense of this data and organise it?

Thanks.

(Please change the title of the post if you think it could be more appropriate)


Probably Ruby's CSV module could handle this, but it's a simple task to break things into something a bit more organized.

From some of the fields' content and some quick searches, I'd say the data has to do with tea exporting. There's no way to know what some of the fields are if your data-source doesn't specify them, so you'll have to work them out. As a guess I'd say:

  • field3 might be an inventory number.
  • field6 might be unit-price in a non-Dollar currency because of the comma, which is often used as a decimal outside of the US.
  • field7 is current inventory level.

#!/usr/bin/env ruby

data = <<END_DATA.split("\n")
5036;MORAN;68681J;FBOP;40;872,7;OUT
5037;MORAN;68624J;GFBOP;40;872,7;OUT
5038;ITAKHOOLI;12124J;TGFOP;40;712,7;245
5039;ITAKHOOLI;12183J;GBOP;40;872,7;125
5040;ITAKHOOLI;12294J;GBOP;40;832,7;125
5041;ITAKHOOLI;12421J;GBOP;32;792,7;125
5042;NYA GOGRA;87230K;TGFOP;20;732,7;OUT
END_DATA

Create an array of hashes:

records = data.inject([]){ |ary, row| 
  fields = row.split(';')
  record = Hash[ *[:id, :company_name, :field3, :category, :field5, :field6, :field7 ].zip(fields).flatten ]
  ary << record
}

require 'pp'
pp records

Which outputs something similar to what you'd get if you requested the data using a SQL lookup:

[{:id=>"5036",
  :company_name=>"MORAN",
  :field3=>"68681J",
  :category=>"FBOP",
  :field5=>"40",
  :field6=>"872,7",
  :field7=>"OUT"},
{:id=>"5037",
  :company_name=>"MORAN",
  :field3=>"68624J",
  :category=>"GFBOP",
  :field5=>"40",
  :field6=>"872,7",
  :field7=>"OUT"},
{:id=>"5038",
  :company_name=>"ITAKHOOLI",
  :field3=>"12124J",
  :category=>"TGFOP",
  :field5=>"40",
  :field6=>"712,7",
  :field7=>"245"},
{:id=>"5039",
  :company_name=>"ITAKHOOLI",
  :field3=>"12183J",
  :category=>"GBOP",
  :field5=>"40",
  :field6=>"872,7",
  :field7=>"125"},
{:id=>"5040",
  :company_name=>"ITAKHOOLI",
  :field3=>"12294J",
  :category=>"GBOP",
  :field5=>"40",
  :field6=>"832,7",
  :field7=>"125"},
{:id=>"5041",
  :company_name=>"ITAKHOOLI",
  :field3=>"12421J",
  :category=>"GBOP",
  :field5=>"32",
  :field6=>"792,7",
  :field7=>"125"},
{:id=>"5042",
  :company_name=>"NYA GOGRA",
  :field3=>"87230K",
  :category=>"TGFOP",
  :field5=>"20",
  :field6=>"732,7",
  :field7=>"OUT"}]

An alternate way to look at the data is as a hash of hashes:

records = data.inject({}){ |h, row| 
  fields = row.split(';')
  record = Hash[ *[:id, :company_name, :field3, :category, :field5, :field6, :field7 ].zip(fields).flatten ]
  h[fields.first] = record
  h
}

pp records

Which outputs

{"5036"=>
  {:id=>"5036",
  :company_name=>"MORAN",
  :field3=>"68681J",
  :category=>"FBOP",
  :field5=>"40",
  :field6=>"872,7",
  :field7=>"OUT"},
"5037"=>
  {:id=>"5037",
  :company_name=>"MORAN",
  :field3=>"68624J",
  :category=>"GFBOP",
  :field5=>"40",
  :field6=>"872,7",
  :field7=>"OUT"},
"5038"=>
  {:id=>"5038",
  :company_name=>"ITAKHOOLI",
  :field3=>"12124J",
  :category=>"TGFOP",
  :field5=>"40",
  :field6=>"712,7",
  :field7=>"245"},
"5039"=>
  {:id=>"5039",
  :company_name=>"ITAKHOOLI",
  :field3=>"12183J",
  :category=>"GBOP",
  :field5=>"40",
  :field6=>"872,7",
  :field7=>"125"},
"5040"=>
  {:id=>"5040",
  :company_name=>"ITAKHOOLI",
  :field3=>"12294J",
  :category=>"GBOP",
  :field5=>"40",
  :field6=>"832,7",
  :field7=>"125"},
"5041"=>
  {:id=>"5041",
  :company_name=>"ITAKHOOLI",
  :field3=>"12421J",
  :category=>"GBOP",
  :field5=>"32",
  :field6=>"792,7",
  :field7=>"125"},
"5042"=>
  {:id=>"5042",
  :company_name=>"NYA GOGRA",
  :field3=>"87230K",
  :category=>"TGFOP",
  :field5=>"20",
  :field6=>"732,7",
  :field7=>"OUT"}}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号