开发者

Importing csv into Numpy datetime64

开发者 https://www.devze.com 2023-04-10 00:54 出处:网络
I am trying out the latest version of numpy 2.0 dev: np.__version__ Out[44]: \'2.0.0.dev-aded70c\' I am trying to import CSV data that looks like this:

I am trying out the latest version of numpy 2.0 dev:

np.__version__
Out[44]: '2.0.0.dev-aded70c'

I am trying to import CSV data that looks like this:

date,system,pumping,rgt,agt,sps,eskom_import,temperature,wind,pressure,weather
2007-01-01 00:30,481.9,,,,,481.9,15,SW,1040,Fine
2007-01-01 01:00,471.9,,,,,471.9,15,SW,1040,Fine
2007-01-01 01:30,455.9,,,,,455.9,,,,

etc.

by using the following code:

convertdict = {0: lambda s: np.datetime64(s, 'm'), 1: lambda s: float(s or 0), 2: lambda s: float(s or 0), 3: lambda s: float(s or 0), 4: lambda s: float(s or 0), 5: lambda s: float(s or 0), 6: lambda s: float(s or 0), 7: lambda s: float(s or 0), 8: str, 9: str, 10: str}

dt = [('date', np.datetime64),('system', float), ('pumping', float),('rgt', 
float), ('agt', float), ('sps', float) ,('eskom_import', float),('temperature', float), ('wind', str), ('pressure', float), ('weather', str)]

a = np.recfromcsv(fp, dtype=dt, converters=convertdict, usecols=range(0-11), 
names=True)         

The dtype it generates for a.date is 'object':

array([2007-01-01T00:30+0200, 2007-01-01T01:00+0200, 2007-01-01T01:30+0200,
       ..., 2007-12-31T23:00+0200, 2007-12-31T23:30+0200,
       2008-01-01T0开发者_StackOverflow中文版0:00+0200], dtype=object)

But I need it to be datetime64, like in this example (but including hrs and minutes):

array(['2011-07-11', '2011-07-12', '2011-07-13', '2011-07-14',
       '2011-07-15', '2011-07-16', '2011-07-17'], dtype='datetime64[D]')

It seems that the CSV import creates an embedded object datetype for 'date' rather than a datetime64 data type. Any ideas on how to fix this?

Grové


I think the trick to avoid the generic 'object' type is to avoid using the recfromcsv function. Manually reading in your data file and parsing the information yields the requested dtype='datetime64[m]'

import numpy as np
dt = np.dtype([ ('date',        '<M8[m]'), 
                ('system',      '<f8'), 
                ('pumping',     '<f8'), 
                ('rgt',         '<f8'), 
                ('agt',         '<f8'), 
                ('sps',         '<f8'), 
                ('eskom_import','<f8'), 
                ('temperature', '<f8'), 
                ('wind',        np.str), 
                ('pressure',    '<f8'), 
                ('weather',     np.str) ])
numfields = len(dt.fields.keys())
data = np.zeros(numlines, dtype=dt)         
fid = open('data.csv', 'rb')
count = 0
try:
    fieldnames = fid.readline().strip().split(',') #Header
    for line in fid:
        parsedline = line.strip().split(',')
        data['date'][count]         = np.datetime64(parsedline[0], 'm')
        data['system'][count]       = np.double(parsedline[1])
        data['pumping'][count]      = np.double(parsedline[2])
        data['rgt'][count]          = np.double(parsedline[3])
        data['agt'][count]          = np.double(parsedline[4])
        data['sps'][count]          = np.double(parsedline[5])
        data['eskom_import'][count] = np.double(parsedline[6])
        data['temperature'][count]  = np.double(parsedline[7])
        data['wind'][count]         = np.str(parsedline[8])
        data['pressure'][count]     = np.double(parsedline[9])
        data['weather'][count]      = np.str(parsedline[10])
        count += 1
 finally:
     fid.close()

>>> data['date']
array(['2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500',
       '2007-01-01T00:30-0500', '2007-01-01T01:00-0500'], dtype='datetime64[m]')

You could definitely improve upon this code by making use of your "convertdict" and iterating over the parsedline but the idea is the same.

0

精彩评论

暂无评论...
验证码 换一张
取 消