I have a very large MySQL query in my web app that looks like this:
query =
SELECT video_tag.video_id, (sum(user_rating.rating) * video.rating_norm) as score
FROM video_tag
JOIN user_rating ON user_rating.item_id = video_tag.tag_id
JOIN video ON video.id = video_tag.video_id
WHERE item_type = 3 AND user_id = 1 AND rating != 0 AND video.website_id = 2
AND rating_norm > 0 AND video_id NOT IN (1,2,3) GROUP BY video_id
ORDER BY score DESC LIMIT 20"
This query joins three tables (video, video_tag, and user_rating), groups the results, and does some basic math to compute a score for each video. This takes about 2s to run as the tables are large.
Instead of making SQL do all this work, I suspect it would be faster to do this computation using NumPy arrays. The data in 'video' and 'video_tag' is constant - so I could just load those table into memory once and not have to ping SQL each time.
However, while I can load these three tables into three separate arrays, I'm having a heck of a time replicating the above query (sp开发者_JAVA技巧ecifically the JOIN and GROUP BY parts). Has anyone any experience with replicating SQL queries using NumPy arrays?
Thanks!
What makes this exercise awkward is the single-data-type constraint for NumPy arrays. For instance, the GROUP BY operation implicitly requires (at least) one field/column of continuous values (to aggregate/sum) and one field/column to partition or group by.
Of course, NumPy recarrays can represent a 2D array (or SQL Table) using a different data type for each column (aka 'Field'), but I find these composite arrays cumbersome to work with. So in the code snippets below, i just used the conventional ndarray class to replicate the two SQL operations highlighted in the OP's Question.
to mimic SQL JOIN in NumPy:
first, create two NumPy arrays (A & B) each to represent an SQL Table. The primary keys for A are in 1st column; foreign key for B also in 1st column.
import numpy as NP
A = NP.random.randint(10, 100, 40).reshape(8, 5)
a = NP.random.randint(1, 3, 8).reshape(8, -1) # add column of primary keys
A = NP.column_stack((a, A))
B = NP.random.randint(0, 10, 4).reshape(2, 2)
b = NP.array([1, 2])
B = NP.column_stack((b, B))
Now (attempt to) replicate JOIN using NumPy array objects:
# prepare the array that will hold the 'result set':
AB = NP.column_stack((A, NP.zeros((A.shape[0], B.shape[1]-1))))
def join(A, B) :
'''
returns None, side effect is population of 'results set' NumPy array, 'AB';
pass in A, B, two NumPy 2D arrays, representing the two SQL Tables to join
'''
k, v = B[:,0], B[:,1:]
dx = dict(zip(k, v))
for i in range(A.shape[0]) :
AB[i:,-2:] = dx[A[i,0]]
to mimic SQL GROUP BY in NumPy:
def group_by(AB, col_id) :
'''
returns 2D NumPy array aggregated on the unique values in column specified by col_id;
pass in a 2D NumPy array and the col_id (integer) which holds the unique values to group by
'''
uv = NP.unique(AB[:,col_id])
temp = []
for v in uv :
ndx = AB[:,0] == v
temp.append(NP.sum(AB[:,1:][ndx,], axis=0))
temp = NP. row_stack(temp)
uv = uv.reshape(-1, 1)
return NP.column_stack((uv, temp))
for a test case, they return the correct result:
>>> A
array([[ 1, 92, 50, 67, 51, 75],
[ 2, 64, 35, 38, 69, 11],
[ 1, 83, 62, 73, 24, 55],
[ 2, 54, 71, 38, 15, 73],
[ 2, 39, 28, 49, 47, 28],
[ 1, 68, 52, 28, 46, 69],
[ 2, 82, 98, 24, 97, 98],
[ 1, 98, 37, 32, 53, 29]])
>>> B
array([[1, 5, 4],
[2, 3, 7]])
>>> join(A, B)
array([[ 1., 92., 50., 67., 51., 75., 5., 4.],
[ 2., 64., 35., 38., 69., 11., 3., 7.],
[ 1., 83., 62., 73., 24., 55., 5., 4.],
[ 2., 54., 71., 38., 15., 73., 3., 7.],
[ 2., 39., 28., 49., 47., 28., 3., 7.],
[ 1., 68., 52., 28., 46., 69., 5., 4.],
[ 2., 82., 98., 24., 97., 98., 3., 7.],
[ 1., 98., 37., 32., 53., 29., 5., 4.]])
>>> group_by(AB, 0)
array([[ 1., 341., 201., 200., 174., 228., 20., 16.],
[ 2., 239., 232., 149., 228., 210., 12., 28.]])
精彩评论