开发者

Unicode characters in OLE CSV import

开发者 https://www.devze.com 2023-02-17 04:57 出处:网络
I have the following code snippet.This is used to import CSV files that are supplied to us from multiple locations around the world.The file format is the same, and actually quite simple, First Name,

I have the following code snippet. This is used to import CSV files that are supplied to us from multiple locations around the world. The file format is the same, and actually quite simple, First Name, Last Name, Email and some dates as well as one or two other text columns. The problem I have is some non-english characters开发者_StackOverflow中文版, russian, german, spanish characters are not being imported correctly. When I look at the contents of the file in the DataTable it produces, for example "Ðндрей" when it should produce "Андрей" and so on. I have looked for a very long time and cant seem to find a solutions. If I save the file into an xls and then import it, changing my connection string of course it works fine, so it seems like the jet engine could handle unicode characters. Any help would be appreciated. If it matters I am using VS 2010 on windows 7 64 bit. Thanks in advance!

  string filename = @"C:\Data\Test.csv";
  string connString = @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\Data;Extended Properties=""text;CharacterSet=UNICODE;HDR=Yes;FMT=Delimited"";";
  string commString = string.Format("Select * from {0}", filename);

  DataTable dt = new DataTable();
  using (OleDbConnection connection = new OleDbConnection(connString))
  {
    connection.Open();
    using (OleDbDataAdapter da = new OleDbDataAdapter(commString, connection))
    {
      da.Fill(dt);
    }
  }


Try

characterset=65001

within your connection string for UTF-8.

string connString = @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=C:\Data;Extended Properties=""text;characterset=65001;HDR=Yes;FMT=Delimited"";";

Follow the link for other codes.


Microsoft products (my only experience is with excel) require a byte order mark (BOM) as the first 2 (for UTF-16*) or 3 (for UTF-8) bytes in the file. When you save a file from excel as "Unicode Text" you can see it embeds FF FE as the first two bytes and the rest of your data is encoded as UTF-16LE. And the notepad save options to similar:

Notepad Encoding     BOM        Character Encoding
-------------------  ---------  --------------------
Unicode              FF FE      UTF-16LE
Unicode Big Endian   FE FF      UTF-16BE
Utf8                 EF BB BF   UTF-8

So check the CSV files in a hex editor or something to see if there's a byte order mark. I suspect it will be missing and its straight into the data. Becuase the raw bytes for your UTF-8 string are being interpreted as windows-1252

UTF-8 String:  Андрей
Bytes:         D0 90 D0 BD D0 B4 D1 80 D0 B5 D0 B9
Windows-1252:  Ð<ERR>ндрей
Where <ERR> is because x90 is not a valid windows-1252 byte

http://sodved.awardspace.info/unicode.pl

That will leave you with two options:

  • If you know what the encoding of the files is (looks like UTF-8 from your syptoms), then see if you can specify it to the file processing. Often there's some parameter or option for it
  • Add the byte order to the data before processing
0

精彩评论

暂无评论...
验证码 换一张
取 消