I am looking for a smarter and better solution.
I want to apply different scaling factors to a number field based on the label content. Hopefully the following code can illustrate what I am trying to achieve:
PS = [('A', 'LABEL1', 20),
('B', 'LABEL2', 15),
('C', 'LABEL3', 120),
('D', 'LABEL1', 3),]
FACTOR = [('LABEL1', 0.1), ('LABEL2', 0.5), ('LABEL3', 10)]
d_factor = dict(FACTOR)
for p in PS:
newp = (p[0], p[1], p[2]*d_factor[p[1]])
print newp
It is a very trivial operation, but I need to perform it on a dataset of at least one million rows.
So, of course, the faster the better.
The factors will be known in advance and they will be no more than 20 to 30 in numbers.
Is there any matrix or linalg trick we can use?
Can ndarray accepts a text value in a cel开发者_运维百科l?
If you want to mix data types you are going to want structured arrays.
If you are going to want the index of matching values in a lookup array you want searchsorted
Your example goes like this:
>>> import numpy as np
>>> PS = np.array([
('A', 'LABEL1', 20),
('B', 'LABEL2', 15),
('C', 'LABEL3', 120),
('D', 'LABEL1', 3),], dtype=('a1,a6,i4'))
>>> FACTOR = np.array([
('LABEL1', 0.1),
('LABEL2', 0.5),
('LABEL3', 10)],dtype=('a6,f4'))
Your structured arrays:
>>> PS
array([('A', 'LABEL1', 20), ('B', 'LABEL2', 15), ('C', 'LABEL3', 120),
('D', 'LABEL1', 3)],
dtype=[('f0', '|S1'), ('f1', '|S6'), ('f2', '<i4')])
>>> FACTOR
array([('LABEL1', 0.10000000149011612), ('LABEL2', 0.5), ('LABEL3', 10.0)],
dtype=[('f0', '|S6'), ('f1', '<f4')])
And you can access individual fields like this (or you can give them names; see the docs):
>>> FACTOR['f0']
array(['LABEL1', 'LABEL2', 'LABEL3'],
dtype='|S6')
How to perform the lookup of FACTOR on PS (FACTOR must be sorted):
>>> idx = np.searchsorted(FACTOR['f0'], PS['f1'])
>>> idx
array([0, 1, 2, 0])
>>> FACTOR['f1'][idx]
array([ 0.1, 0.5, 10. , 0.1], dtype=float32)
Now simply create a new array and multiply:
>>> newp = PS.copy()
>>> newp['f2'] *= FACTOR['f1'][idx]
>>> newp
array([('A', 'LABEL1', 2), ('B', 'LABEL2', 7), ('C', 'LABEL3', 1200),
('D', 'LABEL1', 0)],
dtype=[('f0', '|S1'), ('f1', '|S6'), ('f2', '<i4')])
If you compare two numpy arrays, you get the corresponding indexes. You can use those indexes to do collective operations. This probably isn't the fastest modification, but it is simple and clear. If PS needs to have the structure you show, you can use custom dtype and have a Nx3 array.
import numpy as np
col1 = np.array(['a', 'b', 'c', 'd'])
col2 = np.array(['1', '2', '3', '1'])
col3 = np.array([20., 15., 120., 3.])
factors = {'1': 0.1, '2': 0.5, '3': 10, }
for label, fac in factors.iteritems():
col3[col2==label] *= fac
print col3
I don't think numpy can help you for that. BTW, it is ndarray
, not nparray
...
Maybe you could do it with a generator. See http://www.dabeaz.com/generators/index.html
精彩评论