开发者

Using chardet to detect bad encoding in a MySQL db with JDBC

开发者 https://www.devze.com 2023-04-08 19:39 出处:网络
Recently we moved our mysql db from Latin1 to UTF8. After trying a few different approaches to convert it we weren\'t able to find any that didn\'t also introduce some pretty nasty dataloss (and many

Recently we moved our mysql db from Latin1 to UTF8. After trying a few different approaches to convert it we weren't able to find any that didn't also introduce some pretty nasty dataloss (and many simply did nothing).

This left me wondering if we have lots of different encodings going on since there doesn't seem to be a single approach that covers our test cases (various posts in our database). To test this theory I wrote a small scala app (my first, feel free to make fun of just how cobbled and non-idiomatic it is! :D) that used chardet to look at the posts and tell me the encoding.

Only one problem, everything is always UTF8.

Here's the code:

package main.scala

import org.mozilla.universalchardet.UniversalDetector
import java.sql.DriverManager

object DBConvert {
  def main(args: Array[String]) {
    val detector = new UniversalDetector(null)
    val db_conn_str = "jdbc:mysql://localhost:3306/mt_pre?user=root"
    val connection = DriverManager.getConnection(db_conn_str)

    try {
        val statement = connection.createStatement()
        val rs = statement.executeQuery("SELECT * FROM mt_entry where entry_id = 3886")
        while (rs.next) {
           val buffer = rs.getBytes("entry_text_more")
           detector.handleData(buffer, 0, buffer.length)
           detector.dataEnd()

           val encoding:String = detector.getDetectedCharset;

           if (encoding != null) println("Detected encoding = " + encoding) else println("No encoding detected.");

           detector.reset();

           // Just so we can see the output
           println(rs.getString("entry_text_more"))
        }
    } catch {
      case _ => e: Exception => println(e.getMessage)
    }
    finally {
        connection.close()
    }
  }
}

I tried passing useUnicode the JDBC query string, also characterEncoding. Neither of them budged the UTF-8 always coming out. Also tried using getBinaryStream and others, still UTF-8.

Fully admit that Character encoding makes my head bend a bit and playing with a new language may not be the best way of fixing this problem. :) That said I'm curious - is there a way to grab the data from the db and detect what encoding it was put in there as, or is it one of those things that simply since it's encoded as UTF-8 in the开发者_如何学C DB, no matter how you retrieve it that's just what it is (funny characters and all)?

Thanks!


Once I had a similar problem. See this answer. Setting the encoding inside the connection string may help.


Note that the Table Charset and the Connection CHarset and the Default Database Encoding are all same UTF-8. I had one instance in which Datbases default was UTF-8 , but the table coloumns was still Latin so i had some problem. Please see if thats the case.

0

精彩评论

暂无评论...
验证码 换一张
取 消