开发者

How to convert dumbo sequence file input to tab separated text

开发者 https://www.devze.com 2022-12-09 23:46 出处:网络
I have in input, which could be a single primitive or a list or tuple of primitives. I\'d like to flatten it to just a list, like so:

I have in input, which could be a single primitive or a list or tuple of primitives.

I'd like to flatten it to just a list, like so:

def flatten(values):
    return list(values)

The normal case would be flatten(someiterablethatisn'tastring)

But if values = '1234', I'd get ['1', '2', '3', '4'], but I'd want ['1234']

And if values = 1, I'd get TypeError: 'int' object is not iterable, but I'd want [1]

Is there an elegant way to do this? What I really want to do in the end is just '\t'.join(flatten(values))

Edit: Let me explain this better...

I wish to conver开发者_如何学Pythont a hadoop binary sequence file to a flat tab separated text file using dumbo. Using the output format option, -outputformat text

Dumbo is a python wrapper around hadoop streaming. In short I need to write mapper function:

def mapper(key, values) #do some stuff yield k, v

where k is a string from the first part in the key, and value is a tab separated string containing the rest of the key and the values as strings.

eg:

input: (123, [1,2,3])
output: ('123', '1\t2\t\t3')

or more complicated:

input: ([123, 'abc'], [1,2,3])
output: ('123', 'abc\t1\t2\t\t3')

The input key or value can be a primitive or a list/tuple of primitives I'd like a "flatten" function that can deal with anything, and return a list of values.

For the out value, I'll do something like this v = '\t'.join(list(str(s) for s in flatten(seq)))


Sounds like you want itertools.chain(). You will need to special-case strings, though, since they are really just iterables of characters.

Update:

This is a much simpler problem if you do it as a recursive generator. Try this:

def flatten(*seq):
    for item in seq:
        if isinstance(item, basestring):
            yield item
        else:
            try:
                it = iter(item)
            except TypeError:
                yield item
                it = None
            if it is not None:
                for obj in flatten(it):
                    yield obj

This returns an iterator instead of a list, but it's lazily evaluated, which is probably what you want anyway. If you really need a list, just use list(flatten(seq)) instead.

Update 2:

As others have pointed out, if what you really want is to pass this into str.join(), then you will need to convert all the elements to strings. To do that, you can either replace yield foo with yield str(foo) throughout my example above, or just use code like the following:

"\t".join(str(o) for o in flatten(seq))


Based on your restated question, this mapper function might do what you want:

def mapper(key, values):
    r"""Specification: do some stuff yield k, v where k is a string from the
    first part in the key, and value is a tab separated string containing the
    rest of the key and the values as strings.

    >>> mapper(123, [1,2,3])
    ('123', '1\t2\t3')

    >>> mapper([123, 'abc'], [1,2,3])
    ('123', 'abc\t1\t2\t3')
    """
    if not isinstance(key, list):
        key = [key]
    k, v = key[0], key[1:]
    v.extend(values)
    return str(k), '\t'.join(map(str, v))

if __name__ == '__main__':
    import doctest
    doctest.testmod()

It looks like you'll probably want to change that return to a yield. This also assumes that the input key will always be a single item or a list of items (not a list of lists) and that the input values will always be a list of items (again, not a list of lists).

Does that meet your requirements?


I must say that the stated requirements are odd, and I don't think flatten is the right name for this kind of operation. But if you're really sure that this is what you want, then this is what I can distil from your question:

>>> import itertools 
>>> def to_list_of_strings(input):
...      if isinstance(input, basestring):   # In Py3k: isinstance(input, str)
...          return [input]
...      try:
...          return itertools.chain(*map(to_list_of_strings, input))
...      except TypeError:
...          return [str(input)]
... 
>>> '\t'.join(to_list_of_strings(8))
'8'
>>> '\t'.join(to_list_of_strings((1, 2)))
'1\t2'
>>> '\t'.join(to_list_of_strings("test"))
'test'
>>> '\t'.join(to_list_of_strings(["test", "test2"]))
'test\ttest2'
>>> '\t'.join(to_list_of_strings(range(4)))
'0\t1\t2\t3'
>>> '\t'.join(to_list_of_strings([1, 2, (3, 4)]))
'1\t2\t3\t4'
0

精彩评论

暂无评论...
验证码 换一张
取 消