开发者

Double-byte string comparison in C#

开发者 https://www.devze.com 2023-01-24 13:26 出处:网络
I have two strings one with a double-byte value and the other is a single byte-one. The string comparison result returns false, how do I get them to compare correctly after ignoring the single-byte/do

I have two strings one with a double-byte value and the other is a single byte-one. The string comparison result returns false, how do I get them to compare correctly after ignoring the single-byte/double-byte difference?

string s1 = "smatsumoto11"
string s2 = "smatsumoto11"

In the same scenario, if you have a nvarchar column in SQL server which contains the value smatsumoto11, a query to fetch the data with the where condition having the string smatsumoto11 will return the same row. I need simila开发者_Python百科r semantics with C# string comparison.

I have tried a few options mentioned on MSDN but they don't seem to work.

Any ideas?


Your s1 contains so-called "fullwidth" characters, so you can use string.Compare and tell it to ignore character width:

string.Compare(s1, s2, CultureInfo.CurrentCulture, CompareOptions.IgnoreWidth);

(Of course, specify a different CultureInfo if necessary.)


Before doing the comparison, you could try to "Normalize" your strings:

Returns a new string whose textual value is the same as this string, but whose binary representation is in the specified Unicode normalization form.

Some Unicode characters have multiple equivalent binary representations consisting of sets of combining and/or composite Unicode characters. The existence of multiple representations for a single character complicates searching, sorting, matching, and other operations.


My machine says that s1 is in MS Mincho.

MS Mincho (MS 明朝) - distributed with Japanese version of Windows 3.1 or later, some versions of Internet Explorer 3 Japanese Font Pack, all regions in Windows XP, Microsoft Office v.X to 2004.

The following is totally obsoleted by the answer by Arnout.

I know of a trick that works like //TRANSLIT in iconv and that seems to work here.

        string s1 = "smatsumoto11";
        string s2 = "smatsumoto11";

        string conv = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(s1));

        if (conv == s2) Console.WriteLine("They are the same!");

One day I really have to try to find out how this works...


While the accepted answer works, and is correct with regards to the main issue being "wide" characters, there are a few misconceptions and technicalities in the Question that should be addressed in order to have a better understanding of what is really going on here, both in .NET and in SQL Server.

First:

I have two strings one with a double-byte value and the other is a single byte-one.

No, you don't. You have two Unicode strings, encoded as UTF-16 Little Endian (which is how all of Windows and .NET work). And while in practical terms, most of the time the characters are double-byte, that only covers 62,000 - 63,000 (or so) characters (i.e. the Code Points between U+0000 and U+FFFF -- or 0 - 65,535 -- that are "valid" characters). But Unicode allows for just over 1.1 million Code Points to be mapped, and currently has just over 260,000 of those Code Points already mapped. The Code Points above U+FFFF / 65,535, known as Supplementary Characters, are mapped to sets of two double-byte values known as Surrogate Pairs. So while they are less frequently used, the majority of Unicode Code Points are actually 4 bytes.

Second:

The string comparison result returns false, how do I get them to compare correctly

The letters in s1 = "smatsumoto11" are known as "Fullwidth" characters. You can see the full list of them here:

http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:East_Asian_Width=Fullwidth:]

Some explanation as to why there are different widths in the first place can be found here:

http://unicode-table.com/en/blocks/halfwidth-and-fullwidth-forms/

If you want to compare the two strings in the Question such that they are equal, you can either use the String.Compare(String, String, CultureInfo, CompareOptions) method as mentioned in @Arnout's answer, or you can use CompareInfo.Compare(String, String, CompareOptions) as follows:

CompareInfo.Compare(s1, s2, CompareOptions.IgnoreWidth)

Third:

In the same scenario, if you have a nvarchar column in SQL server which contains the value smatsumoto11, a query to fetch the data with the where condition having the string smatsumoto11 will return the same row.

This is a potentially dangerous way of thinking about string comparisons. There is no particular way that strings compare in pretty much any database, unless the strings are in 7-bit ASCII (values 0 - 127) which don't even include Code Pages, and I don't know if this is even an option. Comparisons are based on the particular LCID / Locale / Culture / Collation. The default Collation in SQL Server (in the US at least) is SQL_Latin1_General_CP1_CI_AS which is Case Insensitive and Accent Sensitive. It is also using Code Page 1252 (which affects CHAR / VARCHAR data, not NCHAR / NVARCHAR), and the "en-US" culture. Collations for other cultures / LCIDs might not equate Fullwidth and "half-width". And, Collations that have _WS in their name definitely would not equate these two strings since _WS stands for "Width Sensitive", which is the default for .NET comparisons if you don't specify the CompareOptions.IgnoreWidth option.

If you run the following query to find the Collations that have _WS in their name, you will find that there are 1776 out of 3885 total Collations that match that are Width Sensitive and would not match these two strings (at least in SQL Server 2012). Of course, there are also 262 binary Collations (i.e. names ending in either the deprecated _BIN or the preferred _BIN2) that would also not equate these strings, but that isn't an issue of width sensitivity.

SELECT *
FROM sys.fn_helpcollations()
WHERE [name] LIKE N'%[_]WS%'
ORDER BY [name];
-- 1776 out of 3885 on SQL Server 2012

Also, as I just mentioned, the unfortunate (and deprecated) default Collation of SQL_Latin1_General_CP1_CI_AS, or even the better version of Latin1_General_100_CI_AS, is Case INsensitive. So the strings that you are comparing are all lower-case so they do equate when using just CompareOptions.IgnoreWidth, but if you want to emulate those particular Collations in SQL Server, then the default behavior of .NET to be Case Sensitive would not match the SQL Server behavior. To better match the SQL Server behavior (at least for those Collations, or any marked as having _CI and not having _WS, you would need to also include the CompareOptions.IgnoreCase option as follows:

CompareInfo.Compare(s1, s2, CompareOptions.IgnoreWidth | CompareOptions.IgnoreCase)

// or

String.Compare(s1, s2, CultureInfo.CurrentCulture, 
               CompareOptions.IgnoreWidth | CompareOptions.IgnoreCase)

Additional Resources:

Comparing Strings in the .NET Framework

Best Practices for Using Strings in the .NET Framework

0

精彩评论

暂无评论...
验证码 换一张
取 消