Optimized searching in Python against a list_问答_开发者

Problem:

Given a list of n obj开发者_如何学编程ects (n's Order of magnitude is 10^5), search for a given item very fast with a minimum of spacetime tradeoff. Current, unoptimized & prototype-y solution takes too long and consumes too much RAM (the optimization is not premature, that is).

There is not a primary key to sort against in the object, but it can be sorted to a certain degree, such as the following example, where the first column is sorted.

o1 => f, g, h
o2 => f, g, i
o3 => f, j, k
o4 => k, j, m

To date, the solution has been nested filters:

filter(test1, filter(test2, filter(test3, the_list)))

But that has been slow, since it involves n * (n - 1) * (n - 2) operations, which approximates to O(n^3) speed, and at least n*2 extra lists of references.

As a note, it would be vastly preferably to have an in-place search.

I haven't found a standard library for handling this. What is the typical solution to this problem?

filter(test1, filter(test2, filter(test3, the_list)))

Firstly, this is O(n) time, not O(n^3) time. The time adds not multiply. The only this could be worse then that is if test3/test2/test1 are doing something odd, in which we should look at those.

If we suggest that each test? function takes 10 ms, then we have 10*3*10^5 ms = 50 minutes. If it was n^3, then we'd have (10*10^5)^3 = 31 million years. I'm pretty sure you are only one linear time, you just have a ton of data.

Replace filter with itertools.ifilter, it'll avoid generating the list. Instead, python will pull one item out of the list at a time, pass it through the three tests and give it to you if and only if it passes. It'll avoid the memory requirement and probably be faster as well.

You aren't going to be able to improve on O(n) time unless you use some indexing techniques. However, the applicability of indexing techniques depends on what you are doing inside the test1/test2/test3 functions. If you want help on that, show an example for those functions.

As other have noted, database were designed to solve these problems. You can make this faster only be reimplementing badly what databases already do for you.

Concatenate the attribute values for each object to make unique keys. You may have to pad the attributes out to the same length to guarantee uniqueness. Construct a hash table to return the object that matches a key.

10^5 is not really that big a number of objects, even in-memory. littletable is a little module I wrote as an experiment for simulating queries, pivots, etc. using just Python dicts. One nice thing about littletable queries is that the result of any query or join is itself a new littletable Table. Indexes are kept as dicts of keys->table objects, and index keys can be defined to be unique or not.

I created a table of 140K objects with 3 single letter keys, and then queried for a specific key. The time to build the table itself was the longest, the indexing and querying pretty fast.

from itertools import product
from littletable import Table,DataObject

objects = Table()
alphas = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
alphas += alphas.lower()
import time

print "building table", time.time()
objects.insert_many(
    DataObject(k1=k1, k2=k2, k3=k3, created=time.time())
        for k1,k2,k3 in product(alphas.upper(),alphas,alphas)
    )
print "table complete", time.time()
print len(objects)

print "indexing table", time.time()
for k in "k1 k2 k3".split():
    objects.create_index(k)
print "index complete", time.time()

print "get specific row", time.time()
matches = objects.query(k1="X", k2="k", k3="W")
for o in matches:
    print o
print time.time()

Prints:

building table 1309377011.63
table complete 1309377012.52
140608
indexing table 1309377012.52
index complete 1309377012.98
get specific row 1309377012.98
{'k3': 'W', 'k2': 'k', 'k1': 'X', 'created': 1309377011.9960001}
{'k3': 'W', 'k2': 'k', 'k1': 'X', 'created': 1309377012.4260001}
1309377013.0

It seems to me one typical solution would be to use a database query. Either SQL (raw or with some kind of ORM), or some kind of object database, maybe MongoDB?

If your data is in a CSV file, you could try sql2csv: https://sourceforge.net/projects/sql2csv/.

EDIT: Pardon my early-onset senility, I meant this project: https://github.com/ccoffey/sql4csv/wiki/Examples.