Splitting C string in Python_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-04-09 04:49 出处：网络

I would like to split a string similar to \'abc \"defg hijk \\\\\"l; mn\\\\\" opqrs\"; tuv\' into ([\'abc\', \'\"defg hijk \\\\\"l; mn\\\\\" opqrs\"\'], 33)

I would like to split a string similar to

'abc "defg hijk \\"l; mn\\" opqrs"; tuv'

into

(['abc', '"defg hijk \\"l; mn\\" opqrs"'], 33)

i.e. I don't want to break on semicolon inside (nested) quotes. What's the easiest way, tokenize? It doesn't hurt if it's fast, but shor开发者_如何学Got is better.

Edit: I forgot one more detail that makes it even more tricky. I need the position of the semicolon that is cutting off the string, or -1 if there is none. (I'm doing changes to legacy code that used to be recursive, but stackoverflowed when the string became very long.)

It's unlikely there is an easy way to solve this without a proper parser. You could probably get away with a hand built parser that doesn't require tokenizing though.

Something like the following should be a good guide:

def parse(s):
    cur_s = []
    strings = []

    def flush_string():
        strings.push(''.join(cur_s))
        cur_s = []

    def handle_special_cases():
        # TODO: Fill this in

    for c in s:
        if c == ';':
            break
        elif c in ['\\' '"']:
            handle_special_cases()
        elif c == ' ':
            flush_string()
        else:
            cur_s.push(c)

    flush_string()
    return strings

It's a stateful search, so simple stateless operations are not available. Here's a simple char-by-char stateful evaluator that might meet your "short" without resorting to full tokenization/parsing:

#!/usr/bin/env python

inp="""abc "defg hijk \\"l; mn\\" opqrs"; tuv'`"""

def words_to_semi(inpstr):
    ret = ['']
    st8 = 1  # state: 1=reg, 2=in quotes, 3=escaped quote, 4=escaped reg, 0=end
    ops = { 1 : {' ': lambda c: (None,1),
                 '"': lambda c: (c,2),
                 ';': lambda c: ('',0),
                 '\\': lambda c: (c,4),
                 },
            2 : {'\\': lambda c: (c,3),
                 '"':  lambda c: (c,1),
                 },
            3 : {None: lambda c: (c,2)},
            4 : {None: lambda c: (c,1)},
            }
    pos = 0

    for C in inpstr:
        oc,st8 = ops[st8].get(C, ops[st8].get(None, lambda c:(c,st8)))(C)
        if not st8: break
        if oc is None:
            ret.append('')
        else:
            ret[-1] += oc
        pos = pos + 1
    return ret, pos

print str(words_to_semi(inp))

Just modify the ops dict (and add new states) to handle other cases; everything else is generic code.

Here's the brute-force method I went with. Brrr...

def f(s):
    instr = False
    inescape = False
    a = ''
    rs = []
    cut_index = -1
    for idx,ch in enumerate(s):
        if instr:
            a += ch
            if inescape:
                inescape = False
            elif ch == '\\':
                inescape = True
            elif ch == '"':
                if a:
                    rs += [a]
                    a = ''
                instr = False
        elif ch == '"':
            if a:
                rs += [a]
            a = ch
            instr = True
        elif ch == ';':
            if a:
                rs += [a]
            cut_index = idx
            break
        elif ch == ' ' or ch == '\t' or ch == '\n':
            if a:
                rs += [a]
                a = ''
        else:
            a += ch
    return rs, cut_index

f('abc "defg hijk \\"l; mn\\" opqrs"; tuv')