Is there an efficient way of concatenating scipy.sparse matrices?_问答_开发者

Is there an efficient way of concatenating scipy.sparse matrices?

开发者 https://www.devze.com 2023-03-23 11:59 出处：网络

I\'m working with some rather large sparse matrices (from 5000x5000 to 20000x20000) and need to find an efficient way to concatenate matrices in a flexible way in order to construct a stochastic matri

I'm working with some rather large sparse matrices (from 5000x5000 to 20000x20000) and need to find an efficient way to concatenate matrices in a flexible way in order to construct a stochastic matrix from separate parts.

Right now I'm using the following way to concatenate four matrices, but it's horribly inefficient. Is there any better way to do this that doesn't involve converting to a dense matrix?

rmat[0:m1.shape[0],0:m1.shape[1]] = m1
rmat[m1.shape[0]:rmat.shape[0],m1.shape[1]:rmat.shape[1]] = m2
rmat[0:m1.shape[0],m1.shape[1]:rmat.shape[1]] = bridg开发者_如何学运维e
rmat[m1.shape[0]:rmat.shape[0],0:m1.shape[1]] = bridge.transpose()

The sparse library now has hstack and vstack for respectively concatenating matrices horizontally and vertically.

Amos's answer is no longer necessary. Scipy now does something similar to this internally if the input matrices are in csr or csc format and the desired output format is set to none or the same format as the input matrices. It's efficient to vertically stack matrices in csr format, or to horizontally stack matrices in csc format, using scipy.sparse.vstack or scipy.sparse.hstack, respectively.

Using hstack, vstack, or concatenate, is dramatically slower than concatenating the inner data objects themselves. The reason is that hstack/vstack converts the sparse matrix to coo format which can be very slow when the matrix is very large not and not in coo format. Here is the code for concatenating csc matrices, similar method can be used for csr matrices:

def concatenate_csc_matrices_by_columns(matrix1, matrix2):
    new_data = np.concatenate((matrix1.data, matrix2.data))
    new_indices = np.concatenate((matrix1.indices, matrix2.indices))
    new_ind_ptr = matrix2.indptr + len(matrix1.data)
    new_ind_ptr = new_ind_ptr[1:]
    new_ind_ptr = np.concatenate((matrix1.indptr, new_ind_ptr))

    return csc_matrix((new_data, new_indices, new_ind_ptr))

Okay, I found the answer. Using scipy.sparse.coo_matrix is much much faster than using lil_matrix. I converted the matrices to coo (painless and fast) and then just concatenated the data, rows and columns after adding the right padding.

data = scipy.concatenate((m1S.data,bridgeS.data,bridgeTS.data,m2S.data))
rows = scipy.concatenate((m1S.row,bridgeS.row,bridgeTS.row + m1S.shape[0],m2S.row + m1S.shape[0]))
cols = scipy.concatenate((m1S.col,bridgeS.col+ m1S.shape[1],bridgeTS.col ,m2S.col + m1S.shape[1])) 

scipy.sparse.coo_matrix((data,(rows,cols)),shape=(m1S.shape[0]+m2S.shape[0],m1S.shape[1]+m2S.shape[1]) )