I've been using Jsoup to scrape HTML data from a website, but there is one section of XML inside a javascript tag that I need to get because it has a bunch of URLs I need to pull out and download the images. Here is what it looks like:
<script type="text/javascript">
var xmlTxt = '<?xml version="1.0" encoding="UTF-8" standalone="yes"?><mediaObject><mediaList rail="1"><carMedia thumbnail="http://images.blah.com/scaler/80/60/images/2011/9/22/307/开发者_C百科179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" url="http://images.blah.com/scaler/544/408/images/2011/9/22/307/179/22343202654.307179719.IM1.MAIN.565x421_A.562x421.jpg" type="INV_PHOTO" mediaLabel="" category="UNCATEGORIZED" sequence="2"/></mediaList></mediaObject>';'
That is followed by a whole bunch of javascript code inside the script tag. What is the best way to extract those URLs from the page if I have a Jsoup Document
? If I can't do it with Jsoup, how can I do it? The problem is that the images are held in a carousel and so the HTML on the page only shows the source for the ones currently displayed in the carousel.
Fist, you can get xmlTxt into java using javascript binding. see http://developer.android.com/guide/webapps/webview.html#BindingJavaScript
Second, parse your xml. I'm not sure you can use Jsoup in general XML(not HTML). If you can't , you can use android builtin xmlpullparser ( http://developer.android.com/reference/org/xmlpull/v1/XmlPullParser.html ) or other xml libraries.
Well, I did it the dirty way but it should work. I was hoping there was a more elegant solution, but for now I just converted the doc to a string ( doc.toString()
) and then get the start and ending index of the starting and ending XML tags that I want. From there I should be able to use the built in Java XML parser to do the rest.
精彩评论