开发者

tool to extract data structures from unclean data

开发者 https://www.devze.com 2023-02-20 14:51 出处:网络
I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data

I have unstructured geneally unclean data in a database field. There are common structures which are consistent in the data

namely:

field:

name:value 

fieldset: 

nombre <FieldSet>
field,
  .
  .
  .
field(n)

table

nombre <table>
head(1)... head(n)
val(1)...  val(n)
      .
      .
      .

I was wondering if there was a tool (preferably in Java) that could extract learn/understand these data structures, parse the file and convert to a Map or object which I could run validation checks on?

I am aware of Antlr but understand this is more geared towards tree const开发者_如何学Goruction, an not independent bits of data (am I wrong about this?)

Does anyone have any suggestions for the problem as a whole?


I recommend Talend. It is very versatile, open source data integration tool. It is based on java. You can use build in tools/components to extract data from unstructured data sources. You can also write complex custom java code to do what you want.

I used Talend in couple of scientific proof of concept projects of mine. It worked for me. Good part is, it is free!


We ended up using antlr for this, it required us to make multiple lexers where one lexer would manipulated the input for the next lexer.

Another project is pads - wrote in C


You should use "bnflite" https://github.com/r35382/bnflite Using this template library you need to develop BNF like gramma for your text by means of classes and overloaded operators directly in C++ code. The benefit is that such gramma is easily adjustable to your source

0

精彩评论

暂无评论...
验证码 换一张
取 消