开发者

How can I build an incremental directed acyclic word graph to store and search strings?

开发者 https://www.devze.com 2022-12-20 17:01 出处:网络
I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.

I am trying to store a large list of strings in a concise manner so that they can be very quickly analyzed/searched through.

A directed acyclic word graph (DAWG) suits this purpose wonderfully. However, I do not have a list of the strings to include in the first place, so it must be incrementally buildable. Additionally, when I search through it for a string, I need to bring back data associated with the result (not just a boolean saying if it was present).

I have found information on a modification of the DAWG for string data tracking here: http://www.pathcom.com/~vadco/adtdawg.html It looks extremely, extremely complex and I am not sure I am capable of writing it.

I have also found a few research papers describing incremental building algorithms, though I've found that research papers in general are not ver开发者_运维知识库y helpful.

I don't think I am advanced enough to be able to combine both of these algorithms myself. Is there documentation of an algorithm already that features these, or an alternative algorithm with good memory use & speed?


I wrote the ADTDAWG web page. Adding words after construction is not an option. The structure is nothing more than 4 arrays of unsigned integer types. It was designed to be immutable for total CPU cache inclusion, and minimal multi-thread access complexity.

The structure is an automaton that forms a minimal and perfect hash function. It was built for speed while traversing recursively using an explicit stack.

As published, it supports up to 18 characters. Including all 26 English chars will require further augmentation.

My advice is to use a standard Trie, with an array index stored in each node. Ya, it is going to seem infantile, but each END_OF_WORD node represents only one word. The ADTDAWG is a solution to each END_OF_WORD node in a traditional DAWG representing many, many words.

Minimal and perfect hash tables are not the sort of thing that you can just put together on the fly.

I am looking for something else to work on, or a job, so contact me, and I'll do what I can. For now, all I can say is that it is unrealistic to use heavy optimization on a structure that is subject to being changed frequently.


Java

For graph problems which require persistence, I'd take a look at the Neo4j graph DB project. Neo4j is designed to store large graphs and allow incremental building and modification of the data, which seems to meet the criteria you describe.

They have some good examples to get you going quickly and there's usually example code to get you started with most problems.

They have a DAG example with a link at the bottom to the full source code.

C++

If you're using C++, a common solution to graph building/analysis is to use the Boost graph library. To persist your graph you could maintain a file based version of the graph in GraphML (for example) and read and write to that file as your graph changes.


You may also want to look at a trie structure for this (potentially building a radix-tree). It seems like a decent 'simple' alternative structure.

I'm suggesting this for a few reasons:

  1. I really don't have a full understanding of your result.
  2. Definitely incremental to build.
  3. Leaf nodes can contain any data you wish.
  4. Subjectively, a simple algorithm.
0

精彩评论

暂无评论...
验证码 换一张
取 消