开发者

How should I store my large MATLAB data files during analysis?

开发者 https://www.devze.com 2023-02-05 19:32 出处:网络
I am having issues with \'data overload\' while processing point cloud data in MATLAB. This is what I am currently doing:

I am having issues with 'data overload' while processing point cloud data in MATLAB. This is what I am currently doing:

  1. I begin with my raw data files, each in the order of ~30Mb each.
  2. I then do initial processing on them to extract n individual objects and remove outlying points, which are all combined into a 1 x n structure, testset, saved into testset.mat (~100Mb).

    So far so good. Now things become complicated:

  3. For each point in each object in testset, I will compute one of a number of features, which ends up being a matrix of some size (for each point). The size of the matrix, and some other properties of the computation, are parameters of the calculations. I save these computed features in a 1 x n cell array, each cell of which contains an array of the matrices for each point.

    I then save this cell array in a .mat file, where the name specified the parameters, the name of the test data used and the types of features extracted. For example:

    testset_feature_type_A_5x5_0.2x0.2_alpha_3_beta_4.mat

  4. Now for each of these files, I then do some further processing (using a classification algorithm). Again there are more parameters to set.

So now I am in a tricky situation, where each final piece of the initial data has come through some path,开发者_JS百科 but the path taken (and the parameters set along that path) are not intrinsically held with the data itself.

So my question is:

Is there a better way to do this? Can anyone who has experience in working with large datasets in MATLAB suggest a way to store the data and the parameter settings more efficiently, and more integrally?

Ideally, I would be able to look up a certain piece of data without having to use regex on the file strings—but there is also an incentive to keep individually processed files separate to save system memory when loading them in (and to help prevent corruption).

The time taken for each calculation (some ~2 hours) prohibits computing data 'on the fly'.


For a similar problem, I have created a class structure that does the following:

  • Each object is linked to a raw data file
  • For each processing step, there is a property
  • The set method of the properties saves the data to file (in a directory with the same name as the raw data file), stores the file name, and updates a "status" property to indicate that this step is done.
  • The get method of the properties loads the data if the file name has been stored and the status indicates "done".
  • Finally, the objects can be saved/loaded, so that I can do some processing now, save the object, later load it and I immediately know how far along the particular data set is in the processing pipeline.

Thus, the only data in memory is the data that is currently being worked on, and you can easily know which data set is at which processing stage. Furthermore, if you set up your methods to accept arrays of objects, you can do very convenient batch processing.


I'm not completely sure if this is what you need, but the save command allows you to store multiple variables inside a single .mat file. If your parameter settings are, for example, stored in an array, then you can save this together with the data set in a single .mat file. Upon loading the file, both the dataset and the array with parameters are restored.

Or do you want to be able to load the parameters without loading the file? Then I would personally opt for the cheap solution of having a second set of files with just the parameters (but similar filenames).

0

精彩评论

暂无评论...
验证码 换一张
取 消