开发者

Splicing files fast in Python and getting their md5

开发者 https://www.devze.com 2023-03-02 15:37 出处:网络
I\'m trying to split up a file into smaller pieces of +/- 300 kilobytes a piece. This is quite slow for a file of 300 megabytes (+/- 1000 pieces)

I'm trying to split up a file into smaller pieces of +/- 300 kilobytes a piece. This is quite slow for a file of 300 megabytes (+/- 1000 pieces)

I'm not using any threading yet, I 'm not sure if that would make it run any faster

    cs = 1
    pieces = 1000

    # Open the file
    f = open(self.file, 'rb')
    result = {}

    while cs <= pieces:

        #Filename
        filename = str(cs).zfill(5) + '.split'

        # Generate temporary filename
        tfile = filename

        # Open the temporary file
        w = open(tfile, 'wb')

        # Read the first split
        tdata = f.read(maxsize)

        # Write the data
        w.write(tdata)

        # Close the file
        w.close()

        # Get the hash of this chunk
        result[filename] = self.__md5(tfile)

        cs += 1

This is the md5 function:

def __md5开发者_运维百科(self, f, block_size=2**20):

    f = open(f, 'rb')

    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.hexdigest()

So is there any way to speed things up?


You're reading the chunk, saving it to a temporary file, then reading the temporary file and computing its md5. That's unnecessary, though - you can compute the md5 while the chunk is still in memory. That means you won't have to open the temp file and read it, which should be faster.

Also I'd recommend a smaller blocksize - maybe 2^11 or 2^12.

0

精彩评论

暂无评论...
验证码 换一张
取 消