开发者

Most popular substrings

开发者 https://www.devze.com 2023-01-19 20:21 出处:网络
I\'m trying to parse a large number of short strings into some logical parts. It seems like an interesting problem that someone could\'ve already solved, but I cannot find any papers / solutions (or m

I'm trying to parse a large number of short strings into some logical parts. It seems like an interesting problem that someone could've already solved, but I cannot find any papers / solutions (or maybe I'm trying wrong keywords).

The strings are have 2-5 parts. If I substitute each word for a letter saying which "part" / "section" it belongs to, here would be a sample of them:

AAABB
AABBBBCC
AABBBBDD
AAACCDD
...

Most of the "sections" are only 2-3 words long and there's ~100-500 occurrences of the exact same section in ~10k strings. That means, there's AAA == "some text here" in 100 strings and AAA == "some other text" in other 100. In one string there can be only one section of each type (and they usually go in order). There is no limited set of values for any section and new values might appear in the future.

The problem is: how do I detect such sections if I ha开发者_运维问答ve enough samples and don't want to mark them manually? This can be supervised / confirmed, not fully automatic, so a probability list is ok.

I was thinking about simply making a list of 2-5 long word n-grams and finding the probability, but that doesn't take order into account (which could be helpful). It will also detect that some text is common, but if I have some specific 2 sections with same values used often, this method will not work nicely. Let's say I have only strings that consist of ABCD with the same values in every line:

ABC
ABD
ACD

Doing only ngram analysis, I'll high probability of A being a section, as well as for AB, C and D. I'd like to eliminate AB from the results in this case, but in a way that doesn't assign own section to words like "the" and eliminate all larger sections which happen to contain "the".

Are there any known solutions for similar problems?


The Lempel-Ziv-Welch algorithm is very efficient at identifying common substrings, but it doesn't try to rank them. It also doesn't pay attention to word or line boundaries. It still might be possible to use it as a starting point to get what you need.

0

精彩评论

暂无评论...
验证码 换一张
取 消