I need to filter stream of text articles by checking every entry for fuzzy matches of predefined string(I am searching for misspelled product names, sometime they have dif开发者_Python百科ferent order of words and extra non letter characters like ":" or ",").
I get excellent results by putting this articles in sphinx index and performing search on it, but unfortunately I get hundreds of articles every second and updating index after getting every article is too slow(and I understand that it's not designed for such task). I need some library which can build in memory index of small ~100kb text and perform fuzzy search on it, does anything like this exist?
This problem is almost identical to Bayesian spam filtering and tools already written for that can just be trained to recognize according to your criteria.
added in response to comment:
So how are you partitioning the stream into bins now? If you already have a corpus of separated articles, just feed that into the classifier. Bayesian classifiers are the way to do fuzzy content matching in context and can classify everything from spam to nucleotides to astronomical spectral categories.
You could use less stochastic methods (e.g. Levenshtein), but at some point you have to describe the difference between hits and misses. The beauty of Bayesian methods, especially if you already have a segregated corpus in hand is that you don't actually need to expressly know how you are classifying.
How about using sqlite fts3 extension?
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);
(You may create any number of columns -- all of them will be indexed)
After that you insert whatever you like, and can search it without index rebuild -- matching either specific column, or the whole row.
(http://www.sqlite.org/fts3.html)
精彩评论