Analyze text (lemmatization, edit distance)_问答_开发者

Analyze text (lemmatization, edit distance)

开发者 https://www.devze.com 2023-02-21 13:16 出处：网络

I need to analyze the text to exist in it banned words. Suppose the black list is the word: \"Forbid\". The word has many forms. In the text the word can be, for example: \"forbidding\", \"forbidden\"

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?

What about typos?

For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?

And what if the text is written as follows:

"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, witho开发者_开发技巧ut whitespace)

How to solve this problem?

Preferably fast algorithm, because text are processed in real time.

And maybe what some tips to improve performance (how to store, etc)?

there're two possible solutions as far as I know algorithms.

You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you could compare benefits of these two and yours suggestion.

You could also use RegEx Matches to check for words. http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx