开发者

Parsing .ashx file in python

开发者 https://www.devze.com 2023-03-19 06:34 出处:网络
I\'m trying to parse the url \'http://www.5min.com/hand开发者_如何学运维lers/SitemapHandler.ashx?type=videositemap&page=1\' in python 2.7. The problem is when i open the url in urlopen, it doesn\'

I'm trying to parse the url 'http://www.5min.com/hand开发者_如何学运维lers/SitemapHandler.ashx?type=videositemap&page=1' in python 2.7. The problem is when i open the url in urlopen, it doesn't display the source, it displays weird characters. It might be encoded.


You are parsing the response of webserver not a .ashx file. Open that url in your browser. That is what python will see when you open it with urlopen.

From opening that these are the headers I got with the response:

Cache-Control:private
Content-Encoding:gzip
Content-Length:1100193
Content-Type:application/xml
Date:Mon, 11 Jul 2011 20:21:40 GMT
Server:Microsoft-IIS/7.5
Set-Cookie:NSC_bobmztjt-5njo-opjq*80=ffffffff4304fd3345525d5f4f58455e445a4a423660;expires=Mon, 11-Jul-2011 20:23:42     GMT;path=/;httponly
X-AspNet-Version:4.0.30319
X-Powered-By:ASP.NET
X-Server:fmv-m09 - www

In fact it looks like the response is going to be in xml format. So you will need to parse the xml with ElementTree (or something else of your preference). Also note that the server is sending the response encoded as gzip (ZipFile), it may or may not do that depending on if urlopen allows that or not. If you're seeing gibberish with Urlopen try using python's ZipFile to decompress the response

0

精彩评论

暂无评论...
验证码 换一张
取 消