开发者

Writing Russian in XML

开发者 https://www.devze.com 2022-12-25 19:01 出处:网络
I am writing a Xml Tag Renamer class开发者_StackOverflow with Java which reads in a XML, renames the tags and write them back into another XML file using DocumentBuilderFactory and TransformerFactory

I am writing a Xml Tag Renamer class开发者_StackOverflow with Java which reads in a XML, renames the tags and write them back into another XML file using DocumentBuilderFactory and TransformerFactory (text nodes are preserved). It worked fine before with German and English texts, until today, when I tried to rename a XML file with russian text. Instead of the source texts I got ????? in the newly created XML file. I've tried setting Encoding

Any idea how to correct this?

PS. Strings were correct before entering TransformerFactory, as I checked in the debugger. I've tried setting OutputKeys.ENCODING to UTF-8 and ISO-8859-5. None of them helped.

The Transformer part:

// Output the XML

// Set up a transformer
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer transformer = transFactory.newTransformer();
transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
// Fix to a bug about indent in transformer
transformer.setOutputProperty
("{http://xml.apache.org/xslt}indent-amount", "4");
transformer.setOutputProperty(OutputKeys.INDENT, "yes");

// TODO encoding parameter
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");

// Create string from xml tree
StringWriter sw = new StringWriter();
StreamResult result = new StreamResult(sw);
DOMSource source = new DOMSource(doc);
transformer.transform(source, result);

String xmlString = sw.toString();

xmlString.replaceAll("\n", System.getProperty("line.separator"));


// Write to file
BufferedWriter output = new BufferedWriter(new FileWriter(outputPath));
output.write(xmlString);
output.close();


I'd suggest directly outputting the result from the transformer to file:

transformer.transform(source, new StreamResult(
   new OutputStreamWriter(new FileOutputStream(outputPath), "UTF-8")));


Your problem is (almost certainly) that you're mixing up what is a character and what is a byte. That's something that you can get away with in English (and mostly in German too) but in scripts like Cyrillic or the Japanese and Chinese ones, you've got to get it right. The first thing to check is whether you have any characters outside the range \u0000\u00ff in the xmlString variable. If so, you've got to use an OutputStreamWriter instance to do the mapping from characters to bytes. If not, the transformation has already been applied and you instead need to write the bytes trapped in that string to the file without mangling them further (again, an OutputStreamWriter is the easiest way to get that right, but using the ISO8859-1 encoding at that final stage as that doesn't remap bytes around).

Outputting the transformed XML directly from the transformer is easier than capturing it first. After all, most XML is only human-readable in a technical sense…

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号