string identity comparison in CPython_问答_开发者

I have recently discovered a potential bug in a production system where two strings were compared using the identity operator, eg:

if val[2] is not 's':

I imagine this will however often work anyway, because as far as I know CPython stores the short immutable strings in the same location. I've replaced it with !=, but I need to confirm that the data that previously went through this code is correct, so I'd like to know if this always worked, or if it only sometimes worked.

The Python version has always been 2.6.6 as far as I know and the above code seems to be the only place where the is operator was used.

Does anyone know if this line will always work as the programmer intended?

edit: Because this is no doubt very specific and unhelpful to future readers, I'll ask a different question:

Where should I look to confirm with absolute certainty the behaviour of the Python implementatio开发者_运维百科n? Are the optimisations in CPython's source code easy to digest? Any tips?

You can look at the CPython code for 2.6.x: http://svn.python.org/projects/python/branches/release26-maint/Objects/stringobject.c

It looks like one-character strings are treated specially, and each distinct one exists only once, so your code is safe. Here's some key code (excerpted):

static PyStringObject *characters[UCHAR_MAX + 1];

PyObject *
PyString_FromStringAndSize(const char *str, Py_ssize_t size)
{
    register PyStringObject *op;
    if (size == 1 && str != NULL &&
        (op = characters[*str & UCHAR_MAX]) != NULL)
    {
        Py_INCREF(op);
        return (PyObject *)op;
    }

...

You are certainly not supposed to use the is/is not operator when you just want to compare two objects without checking if those objects are the same.

While it makes sense that python never creates a new string object with the same contents as an existing one (since strings are immutable) and equality and identity are equivalent due to this, I wouldn't rely on that, especially with the tons of python implementations out there.

As people have already noted, it should always be true for strings created in python (or CPython, anyway), but if you're using a C extension, it won't be the case.

As a quick counter-example:

import numpy as np

x = 's'
y = np.array(['s'], dtype='|S1')

print x
print y[0]

print 'x is y[0] -->', x is y[0]
print 'x == y[0] -->', x == y[0]

This yields:

s
s
x is y[0] --> False
x == y[0] --> True

Of course, if nothing ever used a C extension of any sort, you're probably safe... I wouldn't count on it, though...

Edit: As an even simpler example, it doesn't hold if things have been pickled or packed with struct in any way.

e.g.:

import pickle
x = 's'
pickle.dump(x, file('test', 'w'))
y = pickle.load(file('test', 'r'))

print x is y
print x == y

Also (Using a different letter for clarity, as we need "s" for the formatting string):

import struct
x = 'a'
y = struct.pack('s', x)

print x is y
print x == y

This behavior will always be apply for empty and single character latin-1 strings. From unicodeobject.c:

PyObject *PyUnicode_FromUnicode(const Py_UNICODE *u,
                                Py_ssize_t size)
{
.....
    /* Single character Unicode objects in the Latin-1 range are
       shared when using this constructor */
    if (size == 1 && *u < 256) {
        unicode = unicode_latin1[*u];

This snippet is from Python 3 but it's likely a similar optimization exists in earlier versions.

Granted it works due to automatic short string interning (same as constants in python source, as literal 's' is), but it's quite silly to use identity here.

Python is about duck typing, any object that looks like a string could be used, for example the same code fails if val[2] is actually u"s".