I have a python script which goes off and makes a number of HTTP and urllib requests to various domains.
We have a huge amount of domains to processes and need to do this as quickly as possible. As HTTP requests are slow (i.e. they could time out of there is no website on the domain) I run a number of the scripts at any one time feeding them from a domains list in the database.
The problem I see is over a period of time (a few hours to 24 hours) the scripts all start to slow down and ps -al shows they are sleeping.
The servers are very powerful (8 cores, 72GB ram, 6TB Raid 6 etc etc 80MB 2:1 connection) and are never maxed out, i.e. Free -m
shows
-/+ buffers/cache: 61157 11337
Swap: 4510 195 4315
开发者_如何学Python
Top shows between 80-90% idle
sar -d shows average 5.3% util
and more interestingly iptraf starts off at around 50-60MB/s and ends up 8-10MB/s after about 4 hours.
I am currently running around 500 versions of the script on each server (2 servers) and they both show the same problem.
ps -al
shows that most of the python scripts are sleeping which I don't understand why
for instance:
0 S 0 28668 2987 0 80 0 - 71003 sk_wai pts/2 00:00:03 python
0 S 0 28669 2987 0 80 0 - 71619 inet_s pts/2 00:00:31 python
0 S 0 28670 2987 0 80 0 - 70947 sk_wai pts/2 00:00:07 python
0 S 0 28671 2987 0 80 0 - 71609 poll_s pts/2 00:00:29 python
0 S 0 28672 2987 0 80 0 - 71944 poll_s pts/2 00:00:31 python
0 S 0 28673 2987 0 80 0 - 71606 poll_s pts/2 00:00:26 python
0 S 0 28674 2987 0 80 0 - 71425 poll_s pts/2 00:00:20 python
0 S 0 28675 2987 0 80 0 - 70964 sk_wai pts/2 00:00:01 python
0 S 0 28676 2987 0 80 0 - 71205 inet_s pts/2 00:00:19 python
0 S 0 28677 2987 0 80 0 - 71610 inet_s pts/2 00:00:21 python
0 S 0 28678 2987 0 80 0 - 71491 inet_s pts/2 00:00:22 python
There is no sleep state in the script that gets executed so I can't understand why ps -al shows most of them asleep and why they should get slower and slower making less IP requests over time when CPU, memory, disk access and bandwidth are all available in abundance.
If anyone could help I would be very grateful.
EDIT:
The code is massive as I am using exceptions through it to catch diagnostics about the domain, i.e. reasons I can't connect. Will post the code somewhere if needed, but the fundamental calls via HTTPLib and URLLib are straight off the python examples.
More info:
Both
quota -u mysql quota -u root
come back with nothing
nlimit -n comes back with 1024 have change limit.conf to allow mysql to allow 16000 soft and hard connections and am able to running over 2000 script so far but still still the problem.
SOME PROGRESS
Ok, so I have changed all the limits for the user, ensured all sockets are closed (they were not) and although things are better, I am still getting a slow down although not as bad.
Interestingly I have also noticed some memory leak - the scripts use more and more memory the longer they run, however I am not sure what is causing this. I store output data in a string and then print it to the terminal after every iteration, I do clear the string at the end too but could the ever increasing memory be down to the terminal storing all the output?
Edit: No seems not - ran up 30 scripts without outputting to terminal and still the same leak. I'm not using anything clever (just strings, HTTPlib and URLLib) - wonder if there are any issues with the python mysql connector...?
Check the ulimit
and quota
for the box and the user running the scripts. /etc/security/limits.conf
may also contain resource restrictions that you might want to modify.
ulimit -n
will show the max number of open file descriptors allowed.
- Might this have been exceeded with all of the open sockets?
- Is the script closing each sockets when it's done with it?
You can also check the fd's with ls -l /proc/[PID]/fd/
where [PID]
is the process id of one of the scripts.
Would need to see some code to tell what's really going on..
Edit (Importing comments and more troubleshooting ideas):
Can you show the code where your opening and closing the connections?
When just run a few script processes are running, do they too start to go idle after a while? Or is it only when there are several hundred+ running at once that this happens?
Is there a single parent process that starts all of these scripts?
If your using s = urllib2.urlopen(someURL)
, make sure to s.close()
when your done with it. Python can often close things down for you (like if your doing x = urllib2.urlopen(someURL).read()
), but it will leave that to you if you if told to (such as assigning a variable to the return value of .urlopen()
). Double check your opening and closing of urllib calls (or all I/O code to be safe). If each script is designed to only have 1 open socket at a time, and your /proc/PID/fd
is showing multiple active/open sockets per script process, then there is definitely a code issue to fix.
ulimit -n
showing 1024
is giving the limit of open socket/fd's that the mysql user can have, you can change this with ulimit -S -n [LIMIT_#]
but check out this article first:
Changing process.max-file-descriptor using 'ulimit -n' can cause MySQL to change table_open_cache value.
You may need to log out and shell back in after. And/Or add it to /etc/bashrc
(don't forget to source /etc/bashrc
if you change bashrc
and don't want to log out/in).
Disk space is another thing that I have found out (the hard way) can cause very weird issues. I have had processes act like they are running (not zombied) but not doing what is expected because they had open handles to a log file on a partition with zero disk space left.
netstat -anpTee | grep -i mysql
will also show if these sockets are connected/established/waiting to be closed/waiting on timeout/etc.
watch -n 0.1 'netstat -anpTee | grep -i mysql'
to see the sockets open/close/change state/etc in real time in a nice table output (may need to export GREP_OPTIONS=
first if you have it set to something like --color=always
).
lsof -u mysql
or lsof -U
will also show you open FD's (the output is quite verbose).
import urllib2
import socket
socket.settimeout(15)
# or settimeout(0) for non-blocking:
#In non-blocking mode (blocking is the default), if a recv() call
# doesn’t find any data, or if a send() call can’t
# immediately dispose of the data,
# a error exception is raised.
#......
try:
s = urllib2.urlopen(some_url)
# do stuff with s like s.read(), s.headers, etc..
except (HTTPError, etcError):
# myLogger.exception("Error opening: %s!", some_url)
finally:
try:
s.close()
# del s - although, I don't know if deleting s will help things any.
except:
pass
Some man pages and reference links:
- ulimit
- quota
- limits.conf
- fork bomb
- Changing process.max-file-descriptor using 'ulimit -n' can cause MySQL to change table_open_cache value
- python socket module
- lsof
Solved! - with massive help from Chown - thank you very much!
The slow down was because I was not setting socket timeout and as such over a period of time the robots where hanging trying to read data that did not exist. Adding a simple
timeout = 5
socket.setdefaulttimeout(timeout)
solved it (shame on me - but in my defence I am still learning python)
The memory leak is down to urllib and the version of python I am using. After a lot of googling it appears it is a problem with nested urlopens - lots of post online about it when you work out how to ask the right question of Google.
Thanks all for your help.
EDIT:
Something that also helped the memory leak issue (although not solved it completely) was doing manual garbage collection:
import gc
gc.collect
Hope it helps someone else.
Another system resource to take into account is ephemeral ports /proc/sys/net/ipv4/ip_local_port_range
(on Linux). Together with /proc/sys/net/ipv4/tcp_fin_timeout
they limit the number of concurrent connections.
From Benchmark of Python WSGI Servers:
This basically enables the server to open LOTS of concurrent connections.
echo “10152 65535″ > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=250000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 10240
It is probably some system ressource you're starved of. A guess: could you feel the limits of a pool of sockets your system can handle? If yes, you might see improved performance if you can close the sockets faster/sooner.
EDIT: dependent the effort you want to take, you could restructure your application such that one process does multiple requests. One socket can be reused from within the same process, also a lot of different ressources. Twisted lends itself very much to this type of programming.
精彩评论