I want to parse with XmlSlurper a HTML document which I read using HTTPBuilder. Initialy I tried to do it this way:
def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
def xml = slurper.parse(response)
But it produces an exception:
java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
I found a workaround to 开发者_C百科provide cached DTD files. I found a simple implementation of class which should help here:
class CachedDTD {
/**
* Return DTD 'systemId' as InputSource.
* @param publicId
* @param systemId
* @return InputSource for locally cached DTD.
*/
def static entityResolver = [
resolveEntity: { publicId, systemId ->
try {
String dtd = "dtd/" + systemId.split("/").last()
Logger.getRootLogger().debug "DTD path: ${dtd}"
new org.xml.sax.InputSource(CachedDTD.class.getResourceAsStream(dtd))
} catch (e) {
//e.printStackTrace()
Logger.getRootLogger().fatal "Fatal error", e
null
}
}
] as org.xml.sax.EntityResolver
}
My package tree looks as shown below:
I modified also a little code for parsing response, so it looks like this:
def response = http.get(path: "index.php", contentType: TEXT)
def slurper = new XmlSlurper()
slurper.setEntityResolver(org.yuri.CachedDTD.entityResolver)
def xml = slurper.parse(response)
But now I'm getting java.net.MalformedURLException
. Logged DTD path from CachedDTD entityResolver is org/yuri/dtd/xhtml1-transitional.dtd
and I can't get it working...
there is a HTML parse that you could use, in conjunction with XmlSlurper to address these problems
http://sourceforge.net/projects/nekohtml/
Sample useage here
http://groovy.codehaus.org/Testing+Web+Applications
I was able to solve my parsing issue by using another XmlSlurper
constructor:
public XmlSlurper(boolean validating, boolean namespaceAware, boolean allowDocTypeDeclaration)
like this:
def parser = new XmlSlurper(false, false, true)
In my XML case, disabling the validation (1st parameter false
) and enabling the DOCTYPE declaration (3rd parameter true
) did the trick.
Note:
精彩评论