I'm using NLTK to search for n-grams in a corpus but it's taking a very long time in some cases. I've noticed calculating n-grams isn't an uncommon feature in other packages (apparently Haystack has some functionality for it). Does this mean there's a开发者_运维百科 potentially faster way of finding n-grams in my corpus if I abandon NLTK? If so, what can I use to speed things up?
Since you didn't indicate whether you want word or character-level n-grams, I'm just going to assume the former, without loss of generality.
I also assume you start with a list of tokens, represented by strings. What you can easily do is write n-gram extraction yourself.
def ngrams(tokens, MIN_N, MAX_N):
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
yield tokens[i:j]
Then replace the yield
with the actual action you want to take on each n-gram (add it to a dict
, store it in a database, whatever) to get rid of the generator overhead.
Finally, if it's really not fast enough, convert the above to Cython and compile it. Example using a defaultdict
instead of yield
:
def ngrams(tokens, int MIN_N, int MAX_N):
cdef Py_ssize_t i, j, n_tokens
count = defaultdict(int)
join_spaces = " ".join
n_tokens = len(tokens)
for i in xrange(n_tokens):
for j in xrange(i+MIN_N, min(n_tokens, i+MAX_N)+1):
count[join_spaces(tokens[i:j])] += 1
return count
You might find a pythonic, elegant and fast ngram generation function using zip
and splat (*) operator here :
def find_ngrams(input_list, n):
return zip(*[input_list[i:] for i in range(n)])
For character-level n-grams you could use the following function
def ngrams(text, n):
n-=1
return [text[i-n:i+1] for i,char in enumerate(text)][n:]
def generate_ngrams(words, ngram=2):
return [words[i:i+ngram] for i in range(len(words)-ngram+1)]
sentence = "I really like python, it's pretty awesome."
words = sentence.split()
words
['I', 'really', 'like', 'python,', "it's", 'pretty', 'awesome.']
res = generate_ngrams(words, ngram=2)
res
[['I', 'really'],
['really', 'like'],
['like', 'python,'],
['python,', "it's"],
["it's", 'pretty'],
['pretty', 'awesome.']]
res = generate_ngrams(words, ngram=3)
res
[['I', 'really', 'like'],
['really', 'like', 'python,'],
['like', 'python,', "it's"],
['python,', "it's", 'pretty'],
["it's", 'pretty', 'awesome.']]
res = generate_ngrams(words, ngram=4)
res
[['I', 'really', 'like', 'python,'],
['really', 'like', 'python,', "it's"],
['like', 'python,', "it's", 'pretty'],
['python,', "it's", 'pretty', 'awesome.']]
精彩评论