Unicode and performance_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-11 18:01 出处：网络

I am in the process of migrating a large scale web service to be compatible with international characters. It is a Tomcat/Spring MVC/SQL Server stack. The migration itself was relatively straight forward, we made a few settings changes in Tomcat to force default use of UTF-8 in the response, 开发者_运维技巧changed some Java code to use the encoding and migated a few VARCHAR columns to NVARCHAR followed by a healthy dose of unit/functional tests.

Another person on my team wants a load test now to make sure none of the changes adversely affect the system performance. The individual components of this transition described above don't really hint at any performance changes and, frankly, I don't think it's completely necessary based on my limited knowledge. I plan to do it anyway, but my question, then, is this - are there any performance gotchas one might see in such a migration? Is there anything specific to a different character encoding that might change the performance of a system?

The only thing I could think of would be heavy string comparison and sorting, etc. Any ideas?

You should consider upgrading to SQL Server 2008 R2 because it offers Unicode Compression:

Unicode compression in SQL Server 2008 R2 uses an implementation of the Standard Compression Scheme for Unicode (SCSU) algorithm to compress Unicode values that are stored in row or page compressed objects. For these compressed objects, Unicode compression is automatic for nchar(n) and nvarchar(n) columns. The SQL Server Database Engine stores Unicode data as 2 bytes, regardless of locale. This is known as UCS-2 encoding. For some locales, the implementation of SCSU compression in SQL Server 2008 R2 can save up to 50 percent in storage space.

The greatest gotcha you going to encounter is Data Type Precedence rules. Because NVARCHAR has a higher precedence than VARCHAR any expression that mixes the two will be coerced to NVARCHAR. In practical terms that means that a join condition between column A and column B that before was between two VARCHAR columns and resulted in an index seek now it will be between CAST(A as NVARCHAR) and B (consider we changed only B to NVARCHAR) and this is no longer SARGable (will cause table scan). This problem can appear in joins, in WHERE clauses, in parameter types and many other places. It needs to be carefully considered, the performance degradation that results is tremendous (full scan vs. seek).

I only have this anecdote:

In my former company we ran into the issue that a text field in the database (ASCII) was being matched against unicode string in a query. This caused sql server to switch to table scans rather than the usual index, because it could not prove the string would always be translatable to ascii. This was a significant performance hit for us.

Character encodings, as long as it's done right, shouldn't been an issue. Unicode is a lot more complex but you don't think about that. Somebody else already did. All you need to think about is that you don't convert arbitrary strings in a nonsensical manner.

What you'll see however, is that all your string data will take up twice as much space. This does impact heuristics SQL Server uses to create execution plans and there are subtle issues with indexes that could change, but, I wouldn't worry about that, if you don't have really, really, large data sets.