trying to scrape this xml file on the web with urllib and cElementTree. I am using Google App Engine but I don't think the problem is relevant to my platform.
This is my error:
<type 'exceptions.SyntaxError'>: not well-formed (invalid token): line 1, column 25
Traceback (most recent call last):
File "/base/data/home/apps/metautoit/daily-update.353244196034914877/Start_Update.py", line 25, in main
ShoppingCar.XMLRipper().getNew()
File "/base/data/home/apps/metautoit/daily-update.353244196034914877/updatecars/sitecrawlers/ShoppingCar.py", line 24, in getNew
for carDict in newCars:
File "/base/data/home/apps/metautoit/daily-update.353244196034914877/updatecars/sitecrawlers/ShoppingCar.py", line 67, in _iter_carDicts_in_xml
tree = self.get_xml()
File "/base/data/home/apps/metautoit/daily-update.353244196034914877/updatecars/sitecrawlers/ShoppingCar.py", line 63, in get_xml
return ET.parse(req, parser=parser)
File "<string>", line 45, in parse
File "<string>", line 28, in parse
The xml file is long but here is a sample:
<?xml version="1.0" encoding="windows-1252"?><veicoli>
<veicolo>
<id><![CDATA[16529]]></id>
<link><![CDATA[http://www.shop开发者_运维技巧pingcar.it/auto_usate_/Chrysler_PT_Cruiser/16529.asp]]></link>
<marca><![CDATA[Chrysler]]></marca>
<modello><![CDATA[PT Cruiser]]></modello>
<versione><![CDATA[2.4 L]]></versione>
<provincia><![CDATA[Padova]]></provincia>
<anno><![CDATA[2006]]></anno>
<mese><![CDATA[4]]></mese>
<chilometri><![CDATA[26000]]></chilometri>
<cilindrata><![CDATA[]]></cilindrata>
<potenza><![CDATA[143]]></potenza>
<alimentazione><![CDATA[Benzina]]></alimentazione>
<cambio><![CDATA[Cambio Automatico]]></cambio>
<colore><![CDATA[nero]]></colore>
<prezzo><![CDATA[14900]]></prezzo>
<immagine><![CDATA[http://www.shoppingcar.it/public/Auto%20Usate/Berline/imagesadv/16529_2.jpg]]>
</immagine>
</veicolo>
</veicoli>
My (simplified) code looks like this:
xml_url = "http://www.shoppingcar.it/feed/export_vel.asp?parametro=1"
req = urllib.urlopen(xml_url)
parser = ET.XMLParser(encoding="windows-1252")
tree = ET.parse(req, parser=parser).get_xml()
Here's the kicker: I downloaded and uploaded it as a public dropbox file. Using this url the xml parses just fine. I've tried without declaring the encoding, and tried windows-1252 and utf-8 encodings. It's just really strange because column 25 is nothing but the middle of the word "encoding". Any help is appreciated.
I tried your code (added imports, so that others can try):
#!/usr/bin/env python
import xml.etree.cElementTree as ET
import urllib
xml_url = "http://www.shoppingcar.it/feed/export_vel.asp?parametro=1"
req = urllib.urlopen(xml_url)
parser = ET.XMLParser(encoding="windows-1252")
tree = ET.parse(req, parser=parser).get_xml()
and it runs just fine. If your error only happens on the server then you probably hit a limit at the webpage and are trying to parse an error-message. So make sure that you are actually parsing the document (e.g. data = req.read()
and then dump data and parse the string as xml.
精彩评论