If it is environment-independent, what is the theoretical maximum number of characters in 开发者_开发百科a Python string?
With a 64-bit Python installation, and (say) 64 GB of memory, a Python string of around 63 GB should be quite feasible, if not maximally fast. If you can upgrade your memory beyond 64 GB, your maximum feasible strings should get proportionally longer. (I don't recommend relying on virtual memory to extend that by much, or your runtimes will get simply ridiculous;-).
With a typical 32-bit Python installation, the total memory you can use in your application is limited to something like 2 or 3 GB (depending on OS and configuration), so the longest strings you can use will be much smaller than in 64-bit installations with high amounts of RAM.
I ran this code on an x2iedn.16xlarge EC2 instance, which has 2048 GiB (2.2 TB) of RAM
>>> one_gigabyte = 1_000_000_000
>>> my_str = 'A' * (2000 * one_gigabyte)
It took a couple minutes but I was able to allocate a 2TB string on Python 3.10 running on Ubuntu 22.04.
>>> import sys
>>> sys.getsizeof(my_str)
2000000000049
>>> my_str
'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...
The last line actually hangs, but it would print 2 trillion A
s.
9 quintillion characters on a 64 bit system on CPython 3.10.
That's only if your string is made up of only ASCII characters. The max length can be smaller depending on what characters the string contains due to the way CPython implements strings:
- 9,223,372,036,854,775,758 characters if your string only has ASCII characters (
U+00
toU+7F
) or - 9,223,372,036,854,775,734 characters if your string only has ASCII characters and characters from the Latin-1 Supplement Unicode block (
U+80
toU+FF
) or - 4,611,686,018,427,387,866 characters if your string only contains characters in the Basic Multilingual Plane (for example if it contains Cyrillic letters but no emojis, i.e.
U+0100
toU+FFFF
) or - 2,305,843,009,213,693,932 characters if your string might contain at least one emoji (more formally, if it can contain a character outside the Basic Multilingual Plane, i.e.
U+10000
and above)
On a 32 bit system it's around 2 billion or 500 million characters. If you don't know whether you're using a 64 bit or a 32 bit system or what that means, you're probably using a 64 bit system.
Python strings are length-prefixed, so their length is limited by the size of the integer holding their length and the amount of memory available on your system. Since PEP 353, Python uses Py_ssize_t
as the data type for storing container length. Py_ssize_t
is defined as the same size as the compiler's size_t
but signed. On a 64 bit system, size_t
is 64. 1 bit for the sign means you have 63 bits for the actual quantity, meaning CPython strings cannot be larger than 2⁶³ - 1 bytes or around 9 million TB (8EiB). This much RAM would cost you around 40 billion dollars if we multiply today's price of around $4/GB by 9 billion. On 32-bit systems (which are rare these days), it's 2³¹ - 1 bytes or 2GiB.
CPython will use 1, 2 or 4 bytes per character, depending on how many bytes it needs to encode the "longest" character in your string. So for example if you have a string like 'aaaaaaaaa'
, the a
's each take 1 byte to store, but if you have a string like 'aaaaaaaaa
精彩评论