I have the following code to open and read URLs:
html_data = urllib2.urlopen(req).read()
and I believe this is the most standard way to read data from HTTP. However, when the response have chunked tranfer-encoding, the response starts with the following characters:
1eb0\r\n2625\r\n
<?xml version="1.0" encoding="UTF-8"?>
...
This happens due to the mentioned above chunked encoding and thus my XML data become开发者_开发知识库s corrupted.
So I wonder how I can get rid of all meta-data related to the chunked encoding?
I ended up with custom xml stripping, like this:
xml_start = html_data.find('<?xml')
xml_end = html_data.rfind('</mytag>')
if xml_start !=0:
log_user_action(req.get_host() ,'chunked data', html_data, {})
html_data = html_data[xml_start:]
if xml_end != len(html_data)-len('</mytag>')-1:
html_data = html_data[:xml_end+1]
Can't find any simple solution.
1eb0\r\n2625\r\n is the segment start/stop positions (in hex) in the reassembled payload
You can remove everything before ?xml
html_data = html_data[html_data.find('<?xml'):]
精彩评论