开发者

String recurring subsequences and compression

开发者 https://www.devze.com 2023-01-04 20:50 出处:网络
I\'d like to do some kind of \"search and 开发者_运维技巧replace\" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace al

I'd like to do some kind of "search and 开发者_运维技巧replace" algorithm which will, in an efficient manner if possible, identify a substring of a string which occurs more than once and replace all occurrences of that substring with a token.

For example, given a string "AbcAdAefgAbijkAblmnAbAb", notice that "A" recurs, so reduce in pass one to "#1bc#1d#1efg#1bijk#1blmn#1b#1b" where #_ is an indexed pattern (we note the patterns in an indexed table), then notice that "#1b" recurs so reduce to "#2c#1d#1efg#2ijk#2lmn#2#2". No more patterns occur in the string so we're done.

I have found some information on "longest common subsequences" and compression algorithms, but nothing that seems to do this. They either are for comparing two string or for getting some kind of storage-optimal result.

My objective, on the other hand, is to reduce the genome to its "words" instead of "letters". ie, instead of gatcatcgatc I want to see 2c1c2c. I could do some regex afterwards to find things like "#42*#42"; it would be cool to see recurring brackets in dna.

If I could just find that online I would skip doing it myself but I can't see this question answered before in terms I could uncover. To anyone who can point me in the right direction many thanks.


The byte pair encoding does something pretty close to what you want. Rather than searching directly for the longest repeated string (top-down), each pass of byte pair encoding searches for repeated byte pairs (bottom-up). But eventually it discovers the longest repeated string(*).

  • gatcatcgatc
  • 1=at g1c1cg1c
  • 2=atc g22g2
  • 3=gatc 2=atc 323

As you can see, it has found the longest repeated string "gatc".

(*) byte pair encoding either eventually finds the longest repeated string, or else it stops early after making (2^8 - uniquechars(source) ) substitutions. I suspect it may be possible to tweak byte pair encoding so that the early-stop condition is relaxed a little -- perhaps (2^9 - uniquechars(source) ) or 2^12 or 2^16. Even if that hurts compression performance, perhaps it will give interesting results for applications like yours.

  • Wikipedia: byte pair encoding
  • Stack Overflow: optimizing byte-pair encoding
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号