开发者

python groupby behaviour?

开发者 https://www.devze.com 2023-03-10 21:46 出处:网络
>>from itertools import groupby >>keyfunc = lambda x : x > 500 >>obj = dict(groupby(range(1000), keyfunc))
>>from itertools import groupby
>>keyfunc = lambda x : x > 500
>>obj = dict(groupby(range(1000), keyfunc))
>>list(obj[True])
[999]
>>list(obj[False])
[]

range(1000) is obviously sorted by default for the condition (x > 500).

I was expecting the numbers from 0 to 999 to be grouped in a dict by the condition (x > 500). But the resulting dict开发者_运维知识库ionary had only 999.

where are the other numbers?. Can any one explain what is happening here?


From the docs:

The returned group is itself an iterator that shares the underlying iterable with groupby(). Because the source is shared, when the groupby() object is advanced, the previous group is no longer visible. So, if that data is needed later, it should be stored as a list[.]

And you are storing iterators in obj and materializing them later.

In [21]: dict((k, list(g)) for k, g in groupby(range(10), lambda x : x > 5))
Out[21]: {False: [0, 1, 2, 3, 4, 5], True: [6, 7, 8, 9]}


The groupby iterator returns tuples of the outcome of the grouping function and a new iterator that is tied to the same "outer" iterator the groupby operator is working on. When you apply dict() to the iterator returned by groupby without consuming this "inner" iterator, groupby will have to advance the "outer" iterator for you. You have to realize that the groupby function does not act on a sequence, it turns any such sequence to an iterator for you.

Perhaps this is better explained with some metaphors and handwaving. Please follow along as we form a bucket line.

Imagine iterators as a person drawing water in buckets from a well. He has an unlimited number of buckets to use, but the well may be finite. Every time you ask this person for a bucket of water, he'll draw a new bucket from the well of water and pass it to you.

In the groupby case, you insert another person into your budding bucket chain. This person doesn't immediately pass buckets at all. He passes you the outcome of instructions you gave it plus another person every time you ask for a bucket, whom will then pass you buckets via the groupby person to whomever is asking, as long as they match the same outcome to the instructions. The groupby bucket passer will stop passing these buckets if the outcome of the instructions changes. So well gives buckets to groupby, who passes this to a per-group person, group A, group B, and so on.

In your example, the water is numbered, but there can only be 1000 buckets drawn from the well. Here is what happens when you then pass the groupby person to the dict() call:

  1. Your dict() call asks groupby for a bucket. Now, groupby asks for one bucket from the person at the well, remembers the outcome of the instructions given, holding on to the bucket. To dict() he'll pass the outcome of the instructions (False) plus a new person, group A. The outcome is stored as the key, and the group A person, who wants to pull buckets is stored as the value. This person is not yet asking for buckets however, because no-one is asking it to.

  2. Your dict() call asks groupby for another bucket. groupby has these instructions, and goes looking for the next bucket where the outcome changes. It was still holding on to the first bucket, no-one asked for it, so it throws away this bucket. Instead, it asks for the next bucket from the well and uses his instructions. The outcome is the same as before, so it throws this new bucket away too! More water goes over the floor, and so go the next 499 buckets. Only when the bucket with number 501 is passed does the outcome change, so now groupby finds another person to give instructions to (person group B), together with the new outcome, True, passing these two on to dict().

  3. Your dict() call stores True as a key, and person group B as the value. group B does nothing, no-one is asking it for water.

  4. Your dict() asks for another bucket. groupby spills more water, until it holds bucket with the number 999, and the person at the well shrugs his shoulders and states that now the well is empty. groupby tells dict() the well is empty, no more buckets are coming, could he please stop asking. It still holds the bucket with number 999, because it never has to make space for the next bucket from the well.

  5. Now you come along, asking dict() for the thing associated with the key True, which is person group B. You pass group B to list(), which will therefore ask group B for all the buckets group B can get. group B goes back to groupby, who holds one bucket only, the bucket with number 999, and the outcome of the instructions for this bucket match what group B is looking for. So this one bucket group B gives to list(), then shrugs his shoulders because there are no more buckets, because groupby told him so.

  6. You then ask dict() for the person associated with the key False, which is person group A. By now, groupby has nothing to give any more, the well is dry and he's standing in a puddle of 999 buckets of water with numbers floating around. Your second list() gets nothing.

The moral of this story? Immediately ask for all buckets of water when talking to groupby, because he'll spill them all if you do not! Iterators are like the brooms in fantasia, diligently moving water without understanding, and you better hope you run out of water if you do not know how to control them.

Here is code that would do what you expect (with a little bit less water to prevent flooding):

>>> from itertools import groupby
>>> keyfunc = lambda x : x > 5
>>> obj = dict((k, list(v)) for k, v in groupby(range(10), keyfunc))
>>> obj(True)
[0, 1, 2, 3, 4, 5]
>>> obj(False)
[6, 7, 8, 9]


The thing you are missing is, that the groupby-function iterates over your given range(1000), thus returning 1000 values. You are only saving the last one, in your case 999. What you have to do is, is to iterate over the return values and save them to your dictionary:

dictionary = {}
keyfunc = lambda x : x > 500
for k, g in groupby(range(1000), keyfunc):
    dictionary[k] = list(g)

So the you would get the expected output:

{False: [0, 1, 2, ...], True: [501, 502, 503, ...]}

For more information, see the Python docs about itertools groupby.

0

精彩评论

暂无评论...
验证码 换一张
取 消