开发者

How to read a CSV line with "?

开发者 https://www.devze.com 2022-12-18 07:18 出处:网络
A trivial CSV line could be spitted 开发者_StackOverflow中文版using string split function. But some lines could have \", e.g.

A trivial CSV line could be spitted 开发者_StackOverflow中文版using string split function. But some lines could have ", e.g.

"good,morning", 100, 300, "1998,5,3"

thus directly using string split would not solve the problem.

My solution is to first split out the line using , and then combining the strings with " at then begin or end of the string.

What's the best practice for this problem?

I am interested if there's a Python or F# code snippet for this.

EDIT: I am more interested in the implementation detail, rather than using a library.


There's a csv module in Python, which handles this.

Edit: This task falls into "build a lexer" category. The standard way to do such tasks is to build a state machine (or use a lexer library/framework that will do it for you.)

The state machine for this task would probably only need two states:

  • Initial one, where it reads every character except comma and newline as part of field (exception: leading and trailing spaces) , comma as the field separator, newline as record separator. When it encounters an opening quote it goes into
  • read-quoted-field state, where every character (including comma & newline) excluding quote is treated as part of field, a quote not followed by a quote means end of read-quoted-field (back to initial state), a quote followed by a quote is treated as a single quote (escaped quote).

By the way, your concatenating solution will break on "Field1","Field2" or "Field1"",""Field2".


From python's CSV module:

reading a normal CSV file:

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

Reading a file with an alternate format:

import csv
reader = csv.reader(open("passwd", "rb"), delimiter=':', quoting=csv.QUOTE_NONE)
for row in reader:
    print row

There are some nice usage examples in LinuxJournal.com.

If you're interested with the details, read "split string at commas respecting quotes when string not in csv format" showing some nice regexen related to this problem, or simply read the csv module source.


Chapter 4 of The Practice of Programming gave both C and C++ implementations of the CSV parser.


The generic implementation detail would be something like this (untested)

def csvline2fields(line):
    fields = []
    quote = None
    while line.strip():
        line = line.strip()
        if line[0] in ("'", '"'):
            # Find the next quote:
            end = line.find(line[0])
            fields.append(line[1:end])
            # Find the beginning of the next field
            next = line.find(SEPARATOR)
            if next == -1:
                break
            line = line[next+1:]
            continue
        # find the next separator:
        next = line.find(SEPARATOR)
        fields.append(line[0:next])
        line = line[next+1:]
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号