I'm using the Django database models from a process that's not called from an HTTP request. The process is supposed to poll for new data every few seconds and do some processing on it. I have a loop that sleeps for a few seconds and then gets all unhandled data from the database.
What I'm seeing is that after the first fetch, the process never sees any new data. I ran a few tests and it looks like Django is caching results, even though I'm building new QuerySets every time. To verify this, I did this from a Python shell:
>>> MyModel.objects.count()
885
# (Here I added some more data from another process.)
>>> MyModel.objec开发者_运维技巧ts.count()
885
>>> MyModel.objects.update()
0
>>> MyModel.objects.count()
1025
As you can see, adding new data doesn't change the result count. However, calling the manager's update() method seems to fix the problem.
I can't find any documentation on that update() method and have no idea what other bad things it might do.
My question is, why am I seeing this caching behavior, which contradicts what Django docs say? And how do I prevent it from happening?
Having had this problem and found two definitive solutions for it I thought it worth posting another answer.
This is a problem with MySQL's default transaction mode. Django opens a transaction at the start, which means that by default you won't see changes made in the database.
Demonstrate like this
Run a django shell in terminal 1
>>> MyModel.objects.get(id=1).my_field
u'old'
And another in terminal 2
>>> MyModel.objects.get(id=1).my_field
u'old'
>>> a = MyModel.objects.get(id=1)
>>> a.my_field = "NEW"
>>> a.save()
>>> MyModel.objects.get(id=1).my_field
u'NEW'
>>>
Back to terminal 1 to demonstrate the problem - we still read the old value from the database.
>>> MyModel.objects.get(id=1).my_field
u'old'
Now in terminal 1 demonstrate the solution
>>> from django.db import transaction
>>>
>>> @transaction.commit_manually
... def flush_transaction():
... transaction.commit()
...
>>> MyModel.objects.get(id=1).my_field
u'old'
>>> flush_transaction()
>>> MyModel.objects.get(id=1).my_field
u'NEW'
>>>
The new data is now read
Here is that code in an easy to paste block with docstring
from django.db import transaction
@transaction.commit_manually
def flush_transaction():
"""
Flush the current transaction so we don't read stale data
Use in long running processes to make sure fresh data is read from
the database. This is a problem with MySQL and the default
transaction mode. You can fix it by setting
"transaction-isolation = READ-COMMITTED" in my.cnf or by calling
this function at the appropriate moment
"""
transaction.commit()
The alternative solution is to change my.cnf for MySQL to change the default transaction mode
transaction-isolation = READ-COMMITTED
Note that that is a relatively new feature for Mysql and has some consequences for binary logging / slaving. You could also put this in the django connection preamble if you wanted.
Update 3 years later
Now that Django 1.6 has turned on autocommit in MySQL this is no longer a problem. The example above now works fine without the flush_transaction()
code whether your MySQL is in REPEATABLE-READ
(the default) or READ-COMMITTED
transaction isolation mode.
What was happening in previous versions of Django which ran in non autocommit mode was that the first select
statement opened a transaction. Since MySQL's default mode is REPEATABLE-READ
this means that no updates to the database will be read by subsequent select
statements - hence the need for the flush_transaction()
code above which stops the transaction and starts a new one.
There are still reasons why you might want to use READ-COMMITTED
transaction isolation though. If you were to put terminal 1 in a transaction and you wanted to see the writes from the terminal 2 you would need READ-COMMITTED
.
The flush_transaction()
code now produces a deprecation warning in Django 1.6 so I recommend you remove it.
We've struggled a fair bit with forcing django to refresh the "cache" - which it turns out wasn't really a cache at all but an artifact due to transactions. This might not apply to your example, but certainly in django views, by default, there's an implicit call to a transaction, which mysql then isolates from any changes that happen from other processes ater you start.
we used the @transaction.commit_manually
decorator and calls to transaction.commit()
just before every occasion where you need up-to-date info.
As I say, this definitely applies to views, not sure whether it would apply to django code not being run inside a view.
detailed info here:
http://devblog.resolversystems.com/?p=439
I'm not sure I'd recommend it...but you can just kill the cache yourself:
>>> qs = MyModel.objects.all()
>>> qs.count()
1
>>> MyModel().save()
>>> qs.count() # cached!
1
>>> qs._result_cache = None
>>> qs.count()
2
And here's a better technique that doesn't rely on fiddling with the innards of the QuerySet: Remember that the caching is happening within a QuerySet, but refreshing the data simply requires the underlying Query to be re-executed. The QuerySet is really just a high-level API wrapping a Query object, plus a container (with caching!) for Query results. Thus, given a queryset, here is a general-purpose way of forcing a refresh:
>>> MyModel().save()
>>> qs = MyModel.objects.all()
>>> qs.count()
1
>>> MyModel().save()
>>> qs.count() # cached!
1
>>> from django.db.models import QuerySet
>>> qs = QuerySet(model=MyModel, query=qs.query)
>>> qs.count() # refreshed!
2
>>> party_time()
Pretty easy! You can of course implement this as a helper function and use as needed.
If you append .all()
to a queryset, it'll force a reread from the DB. Try
MyModel.objects.all().count()
instead of MyModel.objects.count()
.
Seems like the count()
goes to cache after the first time. This is the django source for QuerySet.count:
def count(self):
"""
Performs a SELECT COUNT() and returns the number of records as an
integer.
If the QuerySet is already fully cached this simply returns the length
of the cached results set to avoid multiple SELECT COUNT(*) calls.
"""
if self._result_cache is not None and not self._iter:
return len(self._result_cache)
return self.query.get_count(using=self.db)
update
does seem to be doing quite a bit of extra work, besides what you need.
But I can't think of any better way to do this, short of writing your own SQL for the count.
If performance is not super important, I would just do what you're doing, calling update
before count
.
QuerySet.update:
def update(self, **kwargs):
"""
Updates all elements in the current QuerySet, setting all the given
fields to the appropriate values.
"""
assert self.query.can_filter(), \
"Cannot update a query once a slice has been taken."
self._for_write = True
query = self.query.clone(sql.UpdateQuery)
query.add_update_values(kwargs)
if not transaction.is_managed(using=self.db):
transaction.enter_transaction_management(using=self.db)
forced_managed = True
else:
forced_managed = False
try:
rows = query.get_compiler(self.db).execute_sql(None)
if forced_managed:
transaction.commit(using=self.db)
else:
transaction.commit_unless_managed(using=self.db)
finally:
if forced_managed:
transaction.leave_transaction_management(using=self.db)
self._result_cache = None
return rows
update.alters_data = True
You can also use MyModel.objects._clone().count().
All of the methods in the the QuerySet
call _clone()
prior to doing any work - that ensures that any internal caches are invalidated.
The root cause is that MyModel.objects
is the same instance each time. By cloning it you're creating a new instance without the cached value. Of course, you can always reach in and invalidate the cache if you'd prefer to use the same instance.
精彩评论