I have 3000 binary files (each of size 40[MB]) of known format (5,000,000 'records' of 'int32,float32' each). they were created using numpy
tofile() method.
A method that I use, WhichShouldBeUpdated()
, determines which file (out of the 3000) should be updated, and also, which records in this file should be changed. The method's output is the f开发者_开发知识库ollowing:
(1) path_to_file_name_to_update
(2) a numpy record array with N
records (N
is the number of records to update), in the following format: [(recordID1, newIntValue1, newFloatValue1), (recordID2, newIntValue2, newFloatValue2), .....]
As can be seen:
(1) the file to update is known only at running time
(2) the records to update are also only known at running time
what would be the most efficient approach to updating the file with the new values for the records?
Since the records are of fixed length you can just open the file and seek
to the position, which is a multiple of the record size and record offset. To encode the ints and floats as binary you can use struct.pack
. Update: Given that the files are originally generated by numpy, the fastest way may be numpy.memmap
.
You're probably not interested in data conversion, but I've had very good experiences with HDF5 and pytables for large binary files. HDF5 is designed for large scientific data sets, so it is quick and efficient.
精彩评论