开发者

Is it possible to calucate the edit distance between a regexp and a string?

开发者 https://www.devze.com 2023-01-21 01:20 出处:网络
If so, please explain how. Re: what is distance -- \"The distance between two strings is defined as the minimal number of edits required to convert one into the other.\"

If so, please explain how.

Re: what is distance -- "The distance between two strings is defined as the minimal number of edits required to convert one into the other."

For example, xyz to XYZ would take 3 edits, so the string xYZ is closer to XYZ and xyz.开发者_如何转开发

If the pattern is [0-9]{3} or for instance 123, then a23 would be closer to the pattern than ab3.

How can you find the shortest distance between a regexp and a non-matching string?

The above is the Damerau–Levenshtein distance algorithm.


You can use Finite State Machines to do this efficiently (that is, linear in time). If you use a transducer, you can even write the specification of the transformation fairly compactly and do far more nuanced transformations than simply inserts or deletes - see wikipedia for Finite State Transducer as a starting point, and software such as the FSA toolkit or FSA6 (which has a not entirely stable web-demo) too. There are lots of libraries for FSA manipulation; I don't want to suggest the previous two are your only or best options, just two I've heard of.

If, however, you merely want the efficient, approximate searching, a less flexibly but already-implemented-for-you option exists: TRE, which has an approximate matching function that returns the cost of the match - i.e., the distance to the match, from your perspective.


If you mean the string with the smallest levenshtein distance between the closest matched string and a sample, then I'm pretty sure it can be done, but you'd have to convert the Regex to a DFA yourself, then try to match and whenever something fails, non-deterministically continue as if it had passed and keep track of the number differences. you could use A* search or something similar for this, it would be quite inefficient though (O(2^n) worst case)

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号