Loading specific variable with indexing from a MAT-file_问答_开发者

I have a framework on a machine with lo开发者_如何学运维ts of RAM which produces MAT-files with one very large and specifically named matrix. The computation of this matrix is carried only once and takes lot of time. Finally it is stored to a MAT file on the disk.

During the usage phase, this MAT file should be loaded. The problem is that i don't need all the data - only certain selection of columns from that matrix.

For example, i have a matrix 'sign' in a file crfh.mat of size [500x250000] and type double. I may be interested to load only the vectors using 'ids' from that matrix :

sign( :, ids )

Is there a way to do that? I searched the web and no one seems to have expressed the need for such a functionality. I am thinking to write a MEX function select_mat() like :

sign_sub = select_mat( mat_file, var_name, ids );

If you have one really large matrix that you only want to load parts of, I would not save it as a .MAT file. It would be more efficient to write the matrix to its own binary file. Then you could use functions like FSEEK to skip to various indexed points in the file and read only what you need. For example, let's first save a smaller sample matrix to a binary file using the function FWRITE:

>> M = magic(5)  %# A sample matrix
M =
    17    24     1     8    15
    23     5     7    14    16
     4     6    13    20    22
    10    12    19    21     3
    11    18    25     2     9

>> fid = fopen('bigmatrix.dat','w');  %# Open the file for writing
>> fwrite(fid,size(M),'uint8','l');   %# Write the matrix size (needed later) as
                                      %#   2 unsigned 8-bit (1-byte) integers
>> fwrite(fid,M,'uint8','l');         %# Write the matrix data as unsigned 8-bit
                                      %#   (1-byte) integers
>> fclose(fid);                       %# Close the file

Now, we can read just the third column using the functions FREAD and FSEEK:

>> colIndex = 3;
>> fid = fopen('bigmatrix.dat','r');    %# Open the file for reading
>> sizeM = fread(fid,2,'uint8','l');    %# Read the first two bytes to get the
                                        %#   size of the matrix in the file
>> fseek(fid,sizeM(1)*(colIndex-1),0);  %# Seek forward by an amount of two
                                        %#   columns worth of bytes
>> colData = fread(fid,sizeM(1),'uint8','l');  %# Read column 3 data
>> fclose(fid);                         %# Close the file
>> disp(colData)                        %# Confirm that the right column was read
     1
     7
    13
    19
    25

This is just a simple example. You would probably want to write other information to the file (i.e. header information) such as the byte size or data type of each value in the matrix. This may seem like more work than just dumping things to a .MAT file, and it is, but if the efficiency of file IO operations is a big concern it's better to create your own file format to handle your data in this case.

You can load specific variables from a .mat file which has several variables. However, I don't think you can load just a set of arbitrary indices from within a variable in MATLAB.

That said, if your problem is of the type where you need to access only specific rows/columns, then I might have a workaround for you.

You can create a struct from the matrix, with each column as a separate field and then save the .mat file with the -struct option so that each field gets saved as a separate variable. That way, you can pull out the one you want.

dummy=randn(100,200);%# this is a test matrix
[dim1,dim2]=size(dummy);

dummyCell=mat2cell(dummy,dim1,ones(dim2,1));%# create a cell from the matrix
fieldNames=strcat(repmat({'col'},1,dim2),cellfun(@num2str,mat2cell(1:dim2,1,ones(dim2,1)),'UniformOutput',false));%# generate fieldnames for the struct

dummyStruct=cell2struct(dummyCell,fieldNames,2);%# create the struct
save('myDummyFile','-struct','dummyStruct')

I'm not aware of a way to directly convert a matrix to a struct. So, you first split each column up into cells (the ordering is because you indicated that you need to access the columns. If you need the rows, you'll have to switch things around). This is in the cell dummyCell. Now to save to a struct, we need to generate field names. This is in the string cell fieldNames. It generates field names of the form col1,col2,etc... You can name it to something meaningful if you want. Then we convert the cell to a struct, by assigning each cell to the corresponding field name. Lastly the mat file is save with the -struct option, which tells MATLAB to save each field as a separate variable. All of this should be done when your program is saving the giant mat file. Now if you need to access, say col52, all you need to do is load('myDummyFile','col52'). You can also load more than one if you need to.

Remember, this works well if you have an order to your indexing requirements (i.e., each row/each column) if you need to access arbitrary indices in the matrix, then this will not work. There might be some associated overhead while creating the cells/structs and saving it. But this will pay off if you're going to be saving just once, but accessing often.

If your matrix is huge (500x250000 isn't all that huge by today's standards), you'll have to watch out for memory issues with this approach, because we're duplicating the entire matrix into a cell & struct. I wrote it step by step so that it is clearer to understand, but you can reduce the duplication by creating a cell from dummy and assigning it to itself and similarly for the struct. However, this only reduces the number of copies by 1, as Matlab still has to copy a variable to memory to assign to itself after manipulation.