I need my iPhone开发者_开发百科 / iPad app to be able to quickly search through about 10,000 records (about a paragraph worth of text, each), for any substring contained within the record. So if the record contains the word "Flame", querying for "lame" should match.
I'm currently using SQLite, but "LIKE %term%" searches are too slow for this many records. Enabling Full-Text Search doesn't seem like it will fully meet my needs, since SQLite only supports prefix wildcards (e.g. "Flam*", not "*lame").
I've experimented with using a giant blob of text (~350K), and doing [NSString rangeOfString:...], which I think uses a Boyer-Moore algorithm. This is faster than "LIKE %term%" searches, but still not the kind of speed I'm hoping for.
Any suggestions for approaches, or libraries that would achieve this kind of scalable substring search, and which would work on an iPhone?
Here are a number of different options. I am not aware of the bechmarks for each, so you will have to do some testing.
First is the FTS3 extension to SQLite. This should give you fast, indexed full text search: http://regularrateandrhythm.com/regular-rate-rhythm-blog/sqlite3-fts-in-IOS4.html
Then, how about regular expressions which were introduced in iOS 4:
http://developer.apple.com/library/ios/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html
For pre-iOS 4, you can use RegexKitLite:
http://regexkit.sourceforge.net/RegexKitLite/index.html
If you decide to use regular expressions, then take a look at this entry on how to optimize those:
How to speed up iPhone regular expressions with NSRegularExpression?
Perhaps consider combining your second approach with the asynchronous approach. Divide your large block of text into 5,10,whatever size and search them separately with the same number of threads. Then combine results by using a coordinate system that knows how to position the matches correctly (e.g. thread 5 searched region 5 and found a match at position 337 which correlates to document x, position y). You will find that there is a limit where adding more threads does no good so that would be the first thing to figure out.
If you can't tokenize the text (split it into words) you can't index it. That's why LIKE is a sequential search. Unless your substring can be constrained somehow (always drop the first letter or a fixed length for the substring, for instance) your text can't be stored as a list of all possible tokens and those tokens can't be indexed. The key (pun intended) is to find an algorithm that produces a small enough list of tokens that the cost of indexing them is less than the cost of a linear search.
精彩评论