I am looking for a linked list and related algorithms implementation for Python. Everyone I ask just recommends using built in Python lists, but performance measurements indicate that list insertion and removal is a bottleneck for our application. It's trivial to implement a simple linked list, but I wonder if there is a mature library which includes some operations like sort, merge, splice, search, lower/upper bound, etc...
I know this is a dupe, but searching for python list on any search engine gives predictably poor results, with most people just saying that linked lists are unneeded in python (pfft!).
PS: I need to insert and remove from anywhere in the list, not just the ends.
OK, you asked for it: I need to maintain an ordered list of several hundred thousand entries. I will iterate over the list forwards (one by one), using a visitor on each entry, starting from the beginning or a position found by a binary search. When an entry matching a predicate is found it is removed from the list, and then, another binary search is performed on a subset of the list beginning from the removed entry's previous position, until a position determined statistically beforehand. Ignoring the error condition, the modified entry may be used to create another linked list which is spliced into the new position found through the second binary search. Iteration is continued from the position where the entry was removed. On occasion, several thousand contiguous ordered entries may be added to/removed from any place in the list. Sometimes several thousand non-contiguous entries must be searched for and removed incrementally.
python's list is unnacceptable as the cost of insertion/removal is prohibitive, and the minor gains in speed for the binary search are totally irrelevant to the total cost. Our in house tests confirm this.
if I have neglected any detail perhaps I can e-mail you a copy of my company's non-disclosure agreement and I can privately corresp开发者_如何学JAVAond with you on the matter. sarcasm.end().
Here's a blog post sharing your pain. It includes an implementation of a linked list and a performance comparison.
Perhaps blist
would be better, though (from here)?
The use cases where the BList is slightly slower than Python's list are as follows (O(log n) vs. O(1)):
- A large list that never changes length.
- A large lists where inserts and deletes are only at the end of the list (LIFO).
With that disclaimer out of the way, here are some of the use cases where the BLists is dramatically faster than the built-in list:
- Insertion into or removal from a large list (O(log n) vs. O(n))
- Taking large slices of large lists (O(log n) vs O(n))
- Making shallow copies of large lists (O(1) vs. O(n))
- Changing large slices of large lists (O(log n + log k) vs. O(n + k))
- Multiplying a list to make a large, sparse list (O(log k) vs. O(kn))
Note that it's actually implemented as a B+ tree, allowing great performance for all those operations.
Python lists are O(1) for operations at the end of the list. If you'll be doing all your inserting in a semi-sequential fashion--by analogy to C, only keeping a single pointer into the middle of the list as a "cursor" of sorts--you could save yourself a lot of effort by just using two Python lists. One list for what's before the cursor, one for what's after; moving the cursor involves pulling the next item from one list and appending it to the other. That gives you arbitrary O(1) inserting at the cursor location with far less effort and wheel reinventing than making an entire new data structure, letting you reuse a lot of the existing list functions.
For the fully general case allowing multiple references into the list, though, you're probably stuck making a linked list of some sort.
Edit: You're not seriously thinking of actually doing a "binary search" on a linked list, are you? Binary search doesn't even make sense on an intrinsically sequential data structure...
Anyway, if you're okay with linear-time searching and your insertions will always preserve the list order without re-sorting, then a simple linked list might be all you need. If you do as much searching as iterating you should consider something with fast indexing, and if resorting might be required something like a tree would be better.
It's puzzling that everyone demands justification to need a linked list. Linked lists are one of the most elementary data structures for a reason: they have properties that the other major data structures lack, and if you need those properties, you need a linked list or one of its close relatives. If you don't understand why linked lists are an important data structure that can't always be replaced with a deque or a binary tree, you should never have passed your "intro to data structures" class.
Here's a quick implementation, supporting the usual stuff: constant-time insertion at any point given a reference to the node, splitting the list into two lists, and inserting a list into the middle of another list (splice). Generic Python interfaces are supported: push, pop, pushleft, popleft, extend, ordinary iteration, iteration over a slice (getiter).
I just wrote this, so it's doctested but not production tested; there are probably still bugs.
def _ref(obj):
"""
weakref.ref has a bit of braindamage: you can't take a weakref to None.
This is a major hassle and a needless limitation; work around it.
"""
from weakref import ref
if obj is None:
class NullRef(object):
def __call__(self): return None
return NullRef()
else:
return ref(obj)
class _node(object):
def __init__(self, obj):
self.obj = obj
self._next = None
self._prev = _ref(None)
def __repr__(self):
return "node(%s)" % repr(self.obj)
def __call__(self):
return self.obj
@property
def next(self):
return self._next
@property
def prev(self):
return self._prev()
# Implementation note: all "_last" and "prev" links are weakrefs, to prevent circular references.
# This is important; if we don't do this, every list will be a big circular reference. This would
# affect collection of the actual objects in the list, not just our node objects.
#
# This means that _node objects can't exist on their own; they must be part of a list, or nodes
# in the list will be collected. We also have to pay attention to references when we move nodes
# from one list to another.
class llist(object):
"""
Implements a doubly-linked list.
"""
def __init__(self, init=None):
self._first = None
self._last = _ref(None)
if init is not None:
self.extend(init)
def insert(self, item, node=None):
"""
Insert item before node. If node is None, insert at the end of the list.
Return the node created for item.
>>> l = llist()
>>> a = l.insert(1)
>>> b = l.insert(2)
>>> d = l.insert(4)
>>> l._check()
[1, 2, 4]
>>> c = l.insert(3, d)
>>> l._check()
[1, 2, 3, 4]
"""
item = _node(item)
if node is None:
if self._last() is not None:
self._last()._next = item
item._prev = _ref(self._last())
self._last = _ref(item)
if self._first is None:
self._first = item
else:
assert self._first is not None, "insertion node must be None when the list is empty"
if node._prev() is not None:
node._prev()._next = item
item._prev = node._prev
item._next = node
node._prev = _ref(item)
if node is self._first:
self._first = item
return item
def remove(self, node):
"""
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> d = l.append(4)
>>> e = l.append(5)
>>> l.remove(c) # Check removing from the middle
3
>>> l._check()
[1, 2, 4, 5]
>>> l.remove(a) # Check removing from the start
1
>>> l._check()
[2, 4, 5]
>>> l.remove(e) # Check removing from the end
5
>>> l._check()
[2, 4]
"""
if self._first is node:
self._first = node._next
if self._last() is node:
self._last = node._prev
if node._next is not None:
node._next._prev = node._prev
if node._prev() is not None:
node._prev()._next = node._next
node._next = None
node._prev = _ref(None)
return node.obj
def __nonzero__(self):
"""
A list is true if it has any elements.
>>> l = llist()
>>> bool(l)
False
>>> l = llist([1])
>>> bool(l)
True
"""
return self._first is not None
def __iter__(self):
"""
>>> l = llist([1,2,3])
>>> [i() for i in l]
[1, 2, 3]
"""
return self.getiter(self._first, self._last())
def _check(self):
if self._last() is None:
assert self._last() is None
return []
node = self._first
ret = []
while node is not None:
if node._next is None:
assert node == self._last()
if node._prev() is None:
assert node == self._first
if node._next is not None:
assert node._next._prev() == node
if node._prev() is not None:
assert node._prev()._next == node
ret.append(node.obj)
node = node._next
return ret
def getiter(self, first, last):
"""
Return an iterator over [first,last].
>>> l = llist()
>>> l.append(1)
node(1)
>>> start = l.append(2)
>>> l.extend([3,4,5,6])
>>> end = l.append(7)
>>> l.extend([8,9])
>>> [i() for i in l.getiter(start, end)]
[2, 3, 4, 5, 6, 7]
"""
class listiter(object):
def __init__(self, first, last):
self.node = first
self.final_node = last
def __iter__(self): return self
def next(self):
ret = self.node
if ret is None:
raise StopIteration
if ret is self.final_node:
self.node = None
else:
self.node = self.node._next
return ret
return listiter(first, last)
def append(self, item):
"""
Add an item to the end of the list.
>>> l = llist()
>>> l.append(1)
node(1)
>>> l.append(2)
node(2)
>>> l._check()
[1, 2]
"""
return self.insert(item, None)
def appendleft(self, item):
"""
Add an item to the beginning of the list.
>>> l = llist()
>>> l.appendleft(1)
node(1)
>>> l.appendleft(2)
node(2)
>>> l._check()
[2, 1]
"""
return self.insert(item, self._first)
def pop(self):
"""
Remove an item from the end of the list and return it.
>>> l = llist([1,2,3])
>>> l.pop()
3
>>> l.pop()
2
>>> l.pop()
1
>>> l.pop()
Traceback (most recent call last):
...
IndexError: pop from empty llist
"""
if self._last() is None:
raise IndexError, "pop from empty llist"
return self.remove(self._last())
def popleft(self):
"""
Remove an item from the beginning of the list and return it.
>>> l = llist([1,2,3])
>>> l.popleft()
1
>>> l.popleft()
2
>>> l.popleft()
3
>>> l.popleft()
Traceback (most recent call last):
...
IndexError: popleft from empty llist
"""
if self._first is None:
raise IndexError, "popleft from empty llist"
return self.remove(self._first)
def splice(self, source, node=None):
"""
Splice the contents of source into this list before node; if node is None, insert at
the end. Empty source_list. Return the first and last nodes that were moved.
# Test inserting at the beginning.
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> l2 = llist([4,5,6])
>>> l.splice(l2, a)
(node(4), node(6))
>>> l._check()
[4, 5, 6, 1, 2, 3]
>>> l2._check()
[]
# Test inserting in the middle.
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> l2 = llist([4,5,6])
>>> l.splice(l2, b)
(node(4), node(6))
>>> l._check()
[1, 4, 5, 6, 2, 3]
>>> l2._check()
[]
# Test inserting at the end.
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> l2 = llist([4,5,6])
>>> l.splice(l2, None)
(node(4), node(6))
>>> l._check()
[1, 2, 3, 4, 5, 6]
>>> l2._check()
[]
# Test inserting a list with a single item.
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> l2 = llist([4])
>>> l.splice(l2, b)
(node(4), node(4))
>>> l._check()
[1, 4, 2, 3]
>>> l2._check()
[]
"""
if source._first is None:
return
first = source._first
last = source._last()
if node is None:
if self._last() is not None:
self._last()._next = source._first
source._first._prev = self._last
self._last = source._last
if self._first is None:
self._first = source._first
else:
source._first._prev = node._prev
source._last()._next = node
if node._prev() is not None:
node._prev()._next = source._first
node._prev = source._last
if node is self._first:
self._first = source._first
source._first = None
source._last = _ref(None)
return first, last
def split(self, start, end=None):
"""
Remove all items between [node, end] and return them in a new list. If end is None,
remove until the end of the list.
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> d = l.append(4)
>>> e = l.append(5)
>>> l._check()
[1, 2, 3, 4, 5]
>>> l2 = l.split(c, e)
>>> l._check()
[1, 2]
>>> l2._check()
[3, 4, 5]
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> d = l.append(4)
>>> e = l.append(5)
>>> l2 = l.split(a, c)
>>> l._check()
[4, 5]
>>> l2._check()
[1, 2, 3]
>>> l = llist()
>>> a = l.append(1)
>>> b = l.append(2)
>>> c = l.append(3)
>>> d = l.append(4)
>>> e = l.append(5)
>>> l2 = l.split(b, d)
>>> l._check()
[1, 5]
>>> l2._check()
[2, 3, 4]
"""
if end is None:
end = self._last()
ret = llist()
# First, move the region into the new list. It's important to do this first, or
# once we remove the nodes from the old list, they'll be held only by weakrefs and
# nodes could end up being collected before we put it into the new one.
ret._first = start
ret._last = _ref(end)
# Hook our own nodes back together.
if start is self._first:
self._first = end._next
if end is self._last():
self._last = start._prev
if start._prev() is not None:
start._prev()._next = end._next
if end._next is not None:
end._next._prev = start._prev
start._prev = _ref(None)
end._next = None
return ret
def extend(self, items):
"""
>>> l = llist()
>>> l.extend([1,2,3,4,5])
>>> l._check()
[1, 2, 3, 4, 5]
"""
for item in items:
self.append(item)
if __name__ == "__main__":
import doctest
doctest.testmod()
There's a single-linked list here (recipe 17.14 in the Python Cookbook 1st ed) but it's hardly "mature" or rich -- it's just doing a FIFO queue so it's pretty minimal.
This recipe is a very concise C implementation of (read-only) Lisp-like cons-boxes -- just car, cdr and cons; again, not a rich type, rather a minimal one (and to use it for mutable data, as opposed to pure functional approaches, you'd need to add setcar and setcdr at least). It may be a better starting point for you simply because cons-cells are so notoriously flexible and familiar.
Some of the operations you require will be likely best done by existing Python primitives. For example, for sorting, it's hard to see how rolling your own sort can beat the performance of Python's sorted(linkedlist)
(as long of course as you make the linkedlist
type a Python iterable so it plays well with the rest of the language and library;-), considering the power of the timsort
algorithm implemented in the Python runtime.
More generally I suggest you carefully timeit
things every step along the way to consider how much the C-coded approach is really buying you (when compared to the trivial C-coded one exemplified by the recipe in the printed Cookbook whose URL I give at the start of this answer) -- that will crucially depend on the size and nature of your application's lists, so you're the one best placed to organize these benchmarks of course.
Python's deque
class is 0(1) for insertion and deletion at the beginning and end of the list.
" I will iterate over the list forwards (one by one), using a visitor on each entry, starting from the beginning or a position found by a binary search. When an entry matching a predicate is found it is removed from the list, and then, another binary search is performed on a subset of the list beginning from the removed entry's previous position"
It sounds like a linked list is absolutely the wrong data structure for this - doing a binary search will require random access to the list, which will mean repeatedly iterating through the elements. This is likely to be slower on a linked list than inserting and deleting items in a python list.
It sounds like the data structure you want is a skip list. Google throws up several implementations, but I cannot comment on their completeness or quality.
edit:
Another data structure that may be suitable is a threaded binary tree. this is like a regular binary tree but each leaf node points to next/previous subtree, so it can be iterated through as efficiently as a linked list. Implementing it in Python is left as an exercise for the reader (or Google).
For large data, keep a sorted list is a trick. Don't insert but append new items at the end and then sort it. Don't delet item but replace with a special value, sort them to the end and then pop out. For finding, a sorted list also has very quick performance with bisection method. As for small data, iterate old list, filter and build a new one, like list comprehensions method, is always the fast way.
For me, what is large data? it should be over 1000000 items...
Here's an idea, that would require a little coding, but may give you hugely better performance. It may or may not be suitable for your use case.
You can splice a new list into your list by replacing a single element. To insert the list [6, 7, 8] into [1, 2, 3, 4, 5] at index 2, you would end up with
[1, 2, [3, 6, 7, 8], 4, 5]
By not changing the length of the large (here 5 element) list, you'll not have the speed problems you're currently having.
You can 'delete' an element from the list in the same way, by replacing it with an empty list.
[1, 2, [], 4, 5]
To iterate over this mixed list is straightforward.
def IterateNestedList(xs):
for x in xs:
if isinstance(x, list):
for y in IterateNestedList(x): yield y
else: yield x
How about using any of data structures which provide sorted data access? Binary (AVL trees, AVL, Red-black), for instance? These guarantee O(log(N)) insertion complexity. Not O(1), but better than what you have.
I recently had the need for a circular and doubly linked list. Since I am very familiar with the Linux kernel's linked list. I wrote a copycat linked list in Python. It provides O(1) random insertion and deletion. It is much faster than Python's list when you are doing random insertion and deletion on a big list. The code is here: https://github.com/ningke/Pylnklist. I also wrote a little bit of introduction here: http://710003.blogspot.com/2012/06/copycat-linked-list-in-python.html
Looks like there is an external library and we are just a pip install llist
away.
The documentation states:
All data types defined in this module support efficient O(1) insertion and removal of elements (except removal in sllist which is O(n)). Random access to elements using index is O(n).
In my opinion, the removal of elements from dllist should take O(1).
精彩评论