I got interested in this small example of an algorithm in Python for looping through a large word list. I am writing a few "tools" that will allow my to slice a Objective-C string or array in a similar fashion as Python.
Specifically, this elegant solution caught my attention for executing very quickly and it uses a string slice as a key element of the algorithm. Try and solve this without a slice!
I have reproduced my local version using the Moby word list below. You can use /usr/share/dict/words
if you do not feel like downloading Moby. The source is just a large dictionary-like list of unique words.
#!/usr/bin/env python
count=0
words = set(line.strip() for line in
open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl"))
for w in words:
even, odd = w[::2], w[1::2]
if even in words and odd in words:
count+=1
print count
This script will a) be interpreted by Python; b) read the 4.1 MB, 354,983 word Moby dictionary file; c) strip the lines; d) place the lines into a set, and; e) and find all the combinations where the evens and the odds of a given word are also words. This executes in about 0.73 seconds on a MacBook Pro.
I tried to rewrite the same program in Objective-C. I am a beginner at this language, so go easy please, but please do point out the errors.
#import <Foundation/Foundation.h>
NSString *sliceString(NSString *inString, NSUInteger start, NSUInteger stop,
NSUInteger step){
NSUInteger strLength = [inString length];
if(stop > strLength) {
stop = strLength;
}
if(start > strLength) {
start = strLength;
}
NSUInteger capacity = (stop-start)/step;
NSMutableString *rtr=[NSMutableString stringWithCapacity:capacity];
for(NSUInteger i=start; i < stop; i+=step){
[rtr appendFormat:@"%c",[inString characterAtIndex:i]];
}
return rtr;
}
NSSet * getDictWords(NSString *path){
NSError *error = nil;
NSString *words = [[NSString alloc] initWithContentsOfFil开发者_JAVA技巧e:path
encoding:NSUTF8StringEncoding error:&error];
NSCharacterSet *sep=[NSCharacterSet newlineCharacterSet];
NSPredicate *noEmptyStrings =
[NSPredicate predicateWithFormat:@"SELF != ''"];
if (words == nil) {
// deal with error ...
}
// ...
NSArray *temp=[words componentsSeparatedByCharactersInSet:sep];
NSArray *lines =
[temp filteredArrayUsingPredicate:noEmptyStrings];
NSSet *rtr=[NSSet setWithArray:lines];
NSLog(@"lines: %lul, word set: %lul",[lines count],[rtr count]);
[words release];
return rtr;
}
int main (int argc, const char * argv[])
{
NSAutoreleasePool * pool = [[NSAutoreleasePool alloc] init];
int count=0;
NSSet *dict =
getDictWords(@"/Users/andrew/Downloads/Moby/mwords/354984si.ngl");
NSLog(@"Start");
for(NSString *element in dict){
NSString *odd_char=sliceString(element, 1,[element length], 2);
NSString *even_char=sliceString(element, 0, [element length], 2);
if([dict member:even_char] && [dict member:odd_char]){
count++;
}
}
NSLog(@"count=%i",count);
[pool drain];
return 0;
}
The Objective-C version produces the same result, (13,341 words), but takes almost 3 seconds to do it. I must be doing something atrociously wrong for a compiled language to be more than 3X slower than a scripted language, but I'll be darned if I can see why.
The basic algorithm is the same: read the lines, strip them, and put them in a set.
My guess of what is slow is the processing of the NSString elements, but I do not know an alternative.
Edit
I edited the Python to be this:
#!/usr/bin/env python
import codecs
count=0
words = set(line.strip() for line in
codecs.open("/Users/andrew/Downloads/Moby/mwords/354984si.ngl",
encoding='utf-8'))
for w in words:
if w[::2] in words and w[1::2] in words:
count+=1
print count
For the utf-8 to be on the same plane as the utf-8 NSString. This slowed the Python down to 1.9 secs.
I also switch the slice test to short-circuit type as suggested for both the Python and obj-c version. Now they are close to the same speed. I also tried using C arrays rather than NSStrings, and this was much faster, but not as easy. You also loose utf-8 support doing that.
Python is really cool...
Edit 2
I found a bottleneck that sped things up considerably. Instead of using the [rtr appendFormat:@"%c",[inString characterAtIndex:i]];
method to append a character to the return string, I used this:
for(NSUInteger i=start; i < stop; i+=step){
buf[0]=[inString characterAtIndex:i];
[rtr appendString:[NSString stringWithCharacters:buf length:1]];
}
Now I can finally claim that the Objective-C version is faster than the Python version -- but not by much.
Keep in mind that the Python version has been written to move a lot of the heavy lifting down into highly optimised C code when executed on CPython (especially the file input buffering, string slicing and the hash table lookups to check whether even
and odd
are in words
).
That said, you seem to be decoding the file as UTF-8 in your Objective-C code, but leaving the file in binary in your Python code. Using Unicode NSString in the Objective-C version, but 8-bit byte strings in the Python version isn't really a fair comparison - I would expect the performance of the Python version to drop noticeably if you used codecs.open()
to open the file with a declared encoding of "utf-8"
.
You're also making a full second pass to strip the empty lines in your Objective-C, while no such step is present in the Python code.
In BOTH codes you build the even and odd slices then test them against the words. It would be better if you built the odd slice only after the even slice succeeds.
Current Python code:
even, odd = w[::2], w[1::2]
if even in words and odd in words:
Better:
# even, odd = w[::2], w[1::2]
if w[::2] in words and w[1::2] in words:
By the way, one metric you didn't mention: How long did it take you to WRITE each code?
http://developer.apple.com/library/mac/#documentation/Cocoa/Reference/Foundation/Classes/NSSet_Class/Reference/Reference.html suggests that you may want to substitute NSSet with CFSet which may improve the performances.
I could not find with a quick Google search the implementation used for NSSet / CFSet : if they use a Tree based implementation (same as stdc++ set type), then lookup and check are O(log(N)) whereas Python's set lookup O(1), this could account for the speed difference.
[edit] ncoghlan pointed in a remark below that the sets in objective C use a hash table so you get a constant time lookup also. So this boils down to how efficient the implementation of sets is in Python vs in Objective C. As others pointed out, Python's sets and dictionaries are heavily optimized, esp. when strings are used as keys (the Python dictionaries are used to implement the namespaces and need to be very fast).
Your python code is running mainly in build in data structures, which are implemented in C. Python contains incredibly sophisticated optimizations for those data types. Look for talks of Raymond Hettinger for more details. It is mainly about very efficient hashing of objects, use of those hashes for lookup, special memory allocation strategies, ...
I had implemented a network search in python just for testing and we were never able to speed it up in C++, C# or a similar language. Not being beginners in C++ or C#! ;-)
First of all, CPython's library is written in C, and is highly optimized so it is not surprising that a program which leverages the library runs faster than unoptimized Objective C.
The result would be different if you translated the Objective C program line by line into Python.
I suspect that much of the result comes from the fact that the counter is not incremented very often and that it is very efficient for Python to decide that an object is NOT in the set. After all, if you take every second letter of a word, it seems rather unlikely that you will end up with a valid English word, let alone one in the same source text.
精彩评论