开发者

Errors Writing CSVs with Python

开发者 https://www.devze.com 2023-04-05 06:21 出处:网络
I\'m encountering errors in the .csv files that I\'m writing with python (the necessary format because I\'m on a team which depends on .csvs).In a non-patterned way, there are errors creeping up acros

I'm encountering errors in the .csv files that I'm writing with python (the necessary format because I'm on a team which depends on .csvs). In a non-patterned way, there are errors creeping up across hundreds of 1g files. For example, an extra 10 columns for only 1 row, an extra row with eroneous inputs, certain rows missing ~10 colums. I have re-run the same script twice, and on the second run the errors are absent. I need a way to ensure that these files are being written properly. Here is the code I'm using (I know it isn't the most efficient, but I knew how to do it in this fashion, and I wanted to post it how I was doing it).

# Sample inputs, representative of the actual data I'm working with.  
output = np.zeros([40000, 1000]) # for example
iso3 = 'ALB'
sex = 'M'
year = np.ones(40000)
post_env = 开发者_如何学编程np.repeat(10, 40000)
post_cause = np.repeat('a', 40000)
post_pop = np.repeat(100, 40000)
outsheet = np.zeros([output.shape[0], output.shape[1]+7], dtype='|S20')
outsheet[:, 0] = iso3
outsheet[:, 1] = sex
outsheet[:, 2] = np.array(post_year, dtype='|S20')
outsheet[:, 3] = np.array(post_age, dtype='|S20')
outsheet[:, 4] = np.array(post_cause, dtype='|S20')
outsheet[:, 5] = np.array(post_env, dtype='|S20')
outsheet[:, 6] = np.array(post_pop, dtype='|S20')
outsheet[:, 7:] = np.array(output, dtype='|S20')

outsheet[outsheet=='nan'] = '.'
first_row = ['draw' + str(i) for i in range(output.shape[1])]
first_row.insert(0, 'population')
first_row.insert(0, 'envelope')
first_row.insert(0, 'cause')
first_row.insert(0, 'age')
first_row.insert(0, 'year')
first_row.insert(0, 'sex')
first_row.insert(0, 'iso3')
outfile = open('filename', 'w')
writer = csv.writer(outfile)
writer.writerow(first_row)
writer.writerows(outsheet)
outfile.close()

Errors have even included random numeric values in the first column (which should all be 'ALB'), an extra set of rows for an observation, and an observation missing columns (post-writing).


As an aside, using xrange instead of range is usually faster.

Are you absolutely certain that the memory and and disk on the machine that runs the job are good? Since your data ranges into the hundreds of gigabytes, it's not out of the question that you're seeing hardware-based corruption. Even if the machine seems to run stably without crashing, single-bit memory errors are pretty common at these data densities. If any of the hardware is marginal, this is the sort of behavior I'd expect.

Are your disks running a checksummed raided format? (ZFS is my favorite) Are you using ECC memory? Do you see more errors when it's hot during the day? Do you see these errors on the machine itself, or after transferring over a network?

How long does your operation take to run? Do you see more errors towards the end?

0

精彩评论

暂无评论...
验证码 换一张
取 消