I have a web scraping script that gets new data once every minute, but over the course of a couple of days, the script ends up using 200mb or more of memory, and I found out it's because mechanize is keeping an infinite browser history for the .back开发者_Go百科() function to use.
I have looked in the docstrings, and I found the clear_history() function of the browser class, and I invoke that each time I refresh, but I still get 2-3mb higher memory usage on each page refresh. edit: Hmm, seems as if it kept doing the same thing after I called clear_history, up until I got to about 30mb worth of memory usage, then it cleared back down to 10mb or so (which is the base amount of memory my program starts up with)...any way to force this behavior on a more regular basis?
How do I keep mechanize from storing all of this info? I don't need to keep any of it. I'd like to keep my python script below 15mb memory usage.
You can pass an argument history=whatever
when you instantiate the Browser
; the default value is None
which means the browser actually instantiates the History
class (to allow back
and reload
). The simplest approach (will give an attribute error exception if you ever do call back or reload):
class NoHistory(object):
def add(self, *a, **k): pass
def clear(self): pass
b = mechanize.Browser(history=NoHistory())
a cleaner approach would implement other methods in NoHistory
to give clearer exceptions on erroneous use of the browser's back
or reload
, but this simple one should suffice otherwise.
Note that this is an elegant (though not well documented;-) use of the dependency injection design pattern: in a (bleah) "monkeypatching" world, the client code would be expected to overwrite b._history
after the browser is instantiated, but with dependency injection you just pass in the "history" object you want to use. I've often maintained that Dependency Injection may be the most important DP that wasn't in the "gang of 4" book!-).
精彩评论