Stop Jsoup from encoding_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-11 16:45 出处：网络

I\'m trying to parese an URL with JSoup which contains the following Text: Ætterni. After parsing the document the same string looks like that: Ætterni.

相关专题：jsoup

I'm trying to parese an URL with JSoup which contains the following Text: Ætterni. After parsing the document the same string looks like that: Ætterni.

How do I prevent this form 开发者_高级运维happening? I want the document 1:1 exactly like it was.

Code:

doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();

Use doc.outputSettings().escapeMode(EscapeMode.xhtml); for avoiding entities conversion.

You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.

InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.

You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.