开发者

Problem reading ASCII encoded XML and saving as UTF-8

开发者 https://www.devze.com 2023-03-21 14:22 出处:网络
I have a java application that reads in some XML data that is defined as having ASCII encodin开发者_运维问答g. I read in the data via a SAXReader so that I can parse the XML as a Document. Finally I s

I have a java application that reads in some XML data that is defined as having ASCII encodin开发者_运维问答g. I read in the data via a SAXReader so that I can parse the XML as a Document. Finally I save the XML as a String and then save it to a MySQL database. The problem I have is that the save to the database fails with the following error: SQL state [HY000]; error code [1366]; Incorrect string value: '\xEF\xBC\x93con...' for column 'p_xml_data' at row 1

I am having problems finding out why this fails but I am assuming it is to do with encoding types. The database table/column is defined as UTF-8.

Here is the snippet of code I am using:

    final URL url = new URL(feedUrl);
    final SAXReader reader = new SAXReader();
    reader.setValidation(false);
    reader.setIgnoreComments(true);

    Document document = reader.read(url);
    Document savedDocument = document;

    processXml(document.getRootElement());

    String xml = document.asXML().replaceAll("\\s+\n", "");

    feed.setXmlData(xml);

    // now we have the basic XML, lets save it
    feed = getSonyPSNModule().save(feed);

Here is some of the incoming XML although this is taken from the debugger from the document object.

    <?xml version="1.0" encoding="ASCII"?>
    <rss xmlns:dc="http://purl.org/dc/elements/1.1/" >
  <channel>
    <title>Name.com - Name&#xae;3 Games</title>
    <link>http://test.com</link>
    <description>Name.com - Name&#xae;3 Games</description>
    <title>Assassin's Creed&#x2122;</title>

What seems odd is that there is a apostrophe in the title but the Trade Mark character is encoded as &xae;

Does anyone have any ideas what is happening here? I have been trying all sorts of methods and attempting to change encoding types at various points but to no avail.

Here's hoping someone else has had this problem and resolved it!


So you want to change encoding. The bytes themselves shouldn't change as UTF-8 is a super character set of ASCII.

I would change the raw text to change the encoding and remove newlines.


From the Java docs:

A String represents a string in the UTF-16 format ... http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html

So, assuming you're trying to save a String in the database here, it goes like: UTF-8 (XML) -> UTF-16 (Java String) -> UTF-8 (Database). The last step is where it goes wrong right now. You're going to either have to convert that String to UTF-8. Using one of the String object constructors should work: new String(oldString.getBytes("UTF-8"));

0

精彩评论

暂无评论...
验证码 换一张
取 消