I have a question about inner-workings of Python sub-interpreter initialization (from Python/C API) and Python id()
function. More precisely, about handling of global module objects in a WSGI Python containers (like uWSGI used with nginx and mod_wsgi on Apa开发者_如何学运维che).
The following code works as expected (isolated) in both of the mentioned environments, but I can not explain to my self why the id()
function always returns the same value per variable, regardless of the process/sub-interpreter in which it is executed.
from __future__ import print_function
import os, sys
def log(*msg):
print(">>>", *msg, file=sys.stderr)
class A:
def __init__(self, x):
self.x = x
def __str__(self):
return self.x
def set(self, x):
self.x = x
a = A("one")
log("class instantiated.")
def application(environ, start_response):
output = "pid = %d\n" % os.getpid()
output += "id(A) = %d\n" % id(A)
output += "id(a) = %d\n" % id(a)
output += "str(a) = %s\n\n" % a
a.set("two")
status = "200 OK"
response_headers = [
('Content-type', 'text/plain'), ('Content-Length', str(len(output)))
]
start_response(status, response_headers)
return [output]
I have tested this code in uWSGI with one master process and 2 workers; and in mod_wsgi using a deamon mode with two processes and one thread per process. The typical output is:
pid = 15278
id(A) = 139748093678128 id(a) = 139748093962360 str(a) = one
on first load, then:
pid = 15282
id(A) = 139748093678128 id(a) = 139748093962360 str(a) = one
on second, and then
pid = 15278 | pid = 15282
id(A) = 139748093678128 id(a) = 139748093962360 str(a) = two
on every other. As you can see, id()
(memory location) of both the class and the class instance remains the same in both processes (first/second load above), while at the same time class instances live in a separate context (otherwise the second request would show "two" instead of "one")!
I suspect the answer might be hinted by Python docs:
id(object)
:Return the “identity” of an object. This is an integer (or long integer) which is guaranteed to be unique and constant for this object during its lifetime. Two objects with non-overlapping lifetimes may have the same
id()
value.
But if that indeed is the reason, I'm troubled by the next statement that claims the id()
value is object's address!
While I appreciate the fact this could very well be just a Python/C API "clever" feature that solves (or rather fixes) a problem of caching object references (pointers) in 3rd party extension modules, I still find this behavior to be inconsistent with... well, common sense. Could someone please explain this?
I've also noticed mod_wsgi imports the module in each process (i.e. twice), while uWSGI is importing the module only once for both processes. Since the uWSGI master process does the importing, I suppose it seeds the children with copies of that context. Both workers work independently afterwards (deep copy?), while at the same time using the same object addresses, seemingly. (Also, a worker gets reinitialized to the original context upon reload.)
I apologize for such a long post, but I wanted to give enough details. Thank you!
It's not entirely clear what you're asking; I'd give a more concise answer if the question was more specific.
First, the id of an object is, in fact--at least in CPython--its address in memory. That's perfectly normal: two objects in the same process at the same time can't share an address, and an object's address never changes in CPython, so the address works neatly as an id. I don't know how this violates common sense.
Next, note that a backend process may be spawned in two very distinct ways:
- A generic WSGI backend handler will fork processes, and then each of the processes will start a backend. This is simple and language-agnostic, but wastes a lot of memory and wastes time loading the backend code repeatedly.
- A more advanced backend will load the Python code once, and then fork copies of the server after it's loaded. This causes the code to be loaded only once, which is much faster and reduces memory waste significantly. This is how production-quality WSGI servers work.
However, the end result in both of these cases is identical: separate, forked processes.
So, why are you ending up with the same IDs? That depends on which of the above methods is in use.
- With a generic WSGI handler, it's happening simply because each process is doing essentially the same thing. So long as processes are doing the same thing, they'll tend to end up with the same IDs; at some point they'll diverge and this will no longer happen.
- With a pre-loading backend, it's happening because this initial code happens only once, before the server forks, so it's guaranteed to have the same ID.
However, either way, once the fork happens they're separate objects, in separate contexts. There's no significance to objects in separate processes having the same ID.
This is simple to explain by way of a demonstration. You see, when uwsgi creates a new process, it forks the interpreter. Now, forks have interesting memory properties:
import os, time
if os.fork() == 0:
print "child first " + str(hex(id(os)))
time.sleep(2)
os.attr = 'test'
print "child second " + str(hex(id(os)))
else:
time.sleep(1)
print "parent first " + str(hex(id(os)))
time.sleep(2)
print "parent second " + str(hex(id(os)))
print os.attr
Output:
child first 0xb782414cL
parent first 0xb782414cL
child second 0xb782414cL
parent second 0xb782414cL
Traceback (most recent call last):
File "test.py", line 13, in <module>
print os.attr
AttributeError: 'module' object has no attribute 'attr'
Although the objects seem to reside at the same memory addr, they are different objects, but this is not python, but the os.
edit: I suspect the reason that mod_wsgi imports twice is that it creates further processes via calling python rather than forking. uwsgi's approach is better because it can use less memory. fork's page sharing is COW (copy on write).
精彩评论