开发者

Max limit of bytes in method update of Hashlib Python module

开发者 https://www.devze.com 2023-02-10 02:03 出处:网络
I am trying to compute md5 hash of a file with the function hashlib.md5() from hashlib module. So that I writed this piece of code:

I am trying to compute md5 hash of a file with the function hashlib.md5() from hashlib module.

So that I writed this piece of code:

Buffer = 128
f = open("c:\\file.tct", "rb")
m = hashlib.md5()

while True:
   p = f.read(Buffer)
   if len(p) != 0:
      m.update(p)
   else:
      break
print m.hexdigest()
f.close()

I noted the function update is faster if I increase Buffer variable value wit开发者_开发百科h 64, 128, 256 and so on. There is a upper limit I cannot exceed? I suppose it might only a RAM memory problem but I don't know.


Big (≈2**40) chunk sizes lead to MemoryError i.e., there is no limit other than available RAM. On the other hand bufsize is limited by 2**31-1 on my machine:

import hashlib
from functools import partial

def md5(filename, chunksize=2**15, bufsize=-1):
    m = hashlib.md5()
    with open(filename, 'rb', bufsize) as f:
        for chunk in iter(partial(f.read, chunksize), b''):
            m.update(chunk)
    return m

Big chunksize can be as slow as a very small one. Measure it.

I find that for ≈10MB files the 2**15 chunksize is the fastest for the files I've tested.


To be able to handle arbitrarily large files you need to read them in blocks. The size of such blocks should preferably be a power of 2, and in the case of md5 the minimum possible block consists of 64 bytes (512 bits) as 512-bit blocks are the units on which the algorithm operates.

But if we go beyond that and try to establish an exact criterion whether, say 2048-byte block is better than 4096-byte block... we will likely fail. This needs to be very carefully tested and measured, and almost always the value is being chosen at will, judging from experience.


The buffer value is the number of bytes that is read and stored in memory at once, so yes, the only limit is your available memory.

However, bigger values are not automatically faster. At some point, you might run into memory paging issues or other slowdowns with memory allocation if the buffer is too large. You should experiment with larger and larger values until you hit diminishing returns in speed.

0

精彩评论

暂无评论...
验证码 换一张
取 消