I'm querying a web service using urllib2.request and receiving XML. If I violate the web service's rate limit (1 call/second), I receive HTML back saying I've violated the rate limit.
Even though I can time.sleep() for 2-3 seconds after each call, I still, for whatever reason, violate the rate limit.
To test that my response is either XML or HTML, I'm using xml.dom.minidom() and then testing for the presence of an html element
try:
dom = xml.dom.minidom.parseString(response_text)
except xml.parsers.expat.ExpatError:
return False
if len(dom.getElementsByTagName('html')) == 0:
return True
else:
return False
This gets the job done but I've run into a case where one of the XML attributes contains XML. In that case, the parseString() command fails with
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/python/default-2.6/lib/python2.6/xml/dom/minidom.py", line 1918, in parse
return expatbuilder.parse(file)
File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 924, in parse
result = builder.parseFile(fp)
File "/opt/python/default-2.6/lib/python2.6/xml/dom/expatbuilder.py", line 207, in parseFile
parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid to开发者_JAVA技巧ken): line 1, column 3125
In this case, column 3125 is part of some attribute value text that contains ampersand-pound-x-9 (Stackoverflow is hiding my unicode).
Should xml.dom.minidom be able to handle this? Could there be another issue with the XML besides this that's causing the parsing to fail?
Additionally, I'm open to other ways of handling this type of situation if the community has one.
If it helps, here is what the web service returns when I've violated their rate limit:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="eng">
<head>
<title>Service Temporarily Unavailable - Rate Limited</title>
</head>
<body style="text-align:center;background-color:white;">
<h1>Service Temporarily Unavailable</h1>
<hr />
<div>
You have used this service too often in a short time. Please wait before using this service again.
<br/><br/>
Please visit the <a href="http://wiki.xxxx.com/index.php?title=API_Usage">wiki</a> for more details.
</div>
</body>
</html>
I think that 	
is a tab. You should try http://docs.python.org/library/htmllib.html#module-htmlentitydefs to convert special html entities back to whatever they are. (That may have the problem of <
etc). Or you could do a string substitution that substitute 	
with a space.
Just as a suggestion, when you're parsing stuff, and the parser runs into a problem, such as not fitting your pattern, instead of stopping the operation, you should allow the parser to continue, but spit out a warning. This way you can see what the problem is, and potentially correct it, or at least see that there's a problem.
Also as to your problem with the rate limit, why not cache the requested HTML once so you can perform processing locally.
You could also test the string for HTML before attempting to parse the result:
if response_text.lstrip().startswith('<!DOCTYPE html'):
# we received an html response, sleep again
...
I also couldn't get minidom to blow up on an attribute containing a tab entity. Perhaps it is an improperly terminated entity sequence, like 	
without the ending semicolon? Minidom seems okay with properly-escaped entities inside attributes:
text = '<root><a href="	foo<">link</a></root>'
tree = minidom.parseString(text)
print tree.toxml()
u'<?xml version="1.0" ?>\n<root><a href="\tfoo<">link</a></root>'
精彩评论