I've got a requirement to take some XML and transform it into a fixed-width load file for loading to an SAP system. My algorithm works fine except for some weird European characters such as Ã, which, when in a string returns a string length of +1 for each instance of the char. So for example the text Ãbcd would have a string-length($value) of 5 instead of 4.
This is a problem, because my code checks to see what the length of the p开发者_C百科roperty is, then subtracts that from the max-length of the fixed-length output format (i.e. for a 30-width field if it read Ãbcd it would think it needed 25 spaces instead of 26).
Does anyone know of a better way to do this, or what I'm doing wrong in my algorithm?
Below are my xsl templates (for the most part... can't get them in here quite right...)
Template to Write out Property:
<xsl:param name="value"/>
<xsl:param name="width"/>
<!-- find the current length of the field-->
<xsl:variable name="valueWidth" select="string-length($value)" />
<xsl:variable name="difference" select="$width - $valueWidth" />
<xsl:if test="$difference > 0">
<xsl:value-of select="$value"/>
<!-- run this for loop x times outputing space for each -->
<xsl:call-template name="for-loop-spaces">
<xsl:with-param name="count" select="$difference - 1" />
</xsl:call-template>
</xsl:if>
<xsl:if test="($difference < 0)">
<xsl:value-of select="substring($value,0,$width)"/>
</xsl:if>
<xsl:if test="$difference = 0">
<xsl:value-of select="$value"/>
</xsl:if>
</xsl:template>
For-loop-spaces template (it wouldn't copy-paste): outputs a space each time it's called. accepts param "count". If count greater then zero, recursively call with count-1 until 0.
Any input would be very useful :)
The problem is that combining diacritical marks can be used instead of single characters. This is what gives you the "wrong length".
See http://en.wikipedia.org/wiki/Combining_character for more info on those characters.
If you have XSLT 2, there is a built-in function to normalize them which should work: fn:normalize-unicode
For XSLT 1.0, you'd have to use some function to count the characters excluding the combining characters. One possiblity may be the use of translate:
translate($input, '̴̵̶̷̸̡̢̧̨̛̖̗̘̙̜̝̞̟̠̣̤̥̦̩̪̫̬̭̮̯̰̱̲̳̹̺̻̼͇͈͉͍͎̀́̂̃̄̅̆̇̈̉̊̋̌̍̎̏̐̑̒̓̔̽̾̿̀́͂̓̈́͆͊͋͌̕̚ͅ͏͓͔͕͖͙͚͐͑͒͗͛ͣͤͥͦͧͨͩͪͫͬͭͮͯ͘͜͟͢͝͞͠͡', '')
Note that you'll have even more problems if you have asian characters which are combined.
Quote from http://www.dpawson.co.uk/xsl/characters.html
However if the Unicode combining character is used and the input file has e' (where ' is really the combining acute character) then while any Unicode aware renderer is supposed to make this into an e acute for rendering, to an XML engine it is two characters, e and acute.
string-length()
, like all of XSLT/XPath, is character-based, not byte based, so string-length("Ãbcd")
should definitely give 4. If it gives 5 then either:
your
Ã
is actually two separate characters, one of them a combining tilde diacritical, and it's actually correct even if it means the columns don't visually line up. But I'm guessing probably not, since the version you pasted here is a single composed character, U+00C3 LATIN CAPITAL LETTER A WITH TILDE. or,your input XML has been read using the wrong encoding, actually being in UTF-8 (the default for XML) but having been read as something else, typically ISO-8859-1, making the U+00C3 character, represented by the byte sequence 0xC3,0x83, come out as two characters U+00C3,U+0083 (
Ã
).
It's not just “weird European characters” you have to worry about; if you are getting Unicode wrong then all characters outside of the basic 7-bit ASCII set are going to get mangled, including many that even insular Americans like to use.
In any case there is the question of what encoding SAP wants for its FWV input format. It's all very well treating Ã
as a single character and adding the right number of padding characters for one character, but if you then output to UTF-8 and SAP doesn't actually read UTF-8, it's still going to break the import.
You'll need to find out the encoding expected by the target SAP installation (if it's not UTF-8, cp1252 is another good guess to try), and whether the fixed columns of the format are based on Unicode characters or bytes. From this (related?) spec I believe they're actually based on bytes, in which case 5 would actually be the correct byte length, if your database is supposed to contain UTF-8.
Unfortunately XSLT is all about characters and doesn't give you the chance to work with bytes, so if the input file is byte-based you'll have to either:
remove all non-ASCII characters, making the point moot, or
use another tool outside XSLT to do this processing, one that knows about bytes. To be honest this makes most sense to me: XSLT is ideal for XML-to-XML transforms and largely awful for other string processing tasks. Your template above could be made more readable and efficient re-written in a couple of lines of a modern scripting language like Python.
Are you counting bytes or characters? The Ã
you are mentioning is 1 character, but 2 bytes (when using UTF-8, which seems to be the case). Characters in UTF-8 can take 1-4 bytes.
If string-length counts bytes, the result is correct.
This is not an XSLT issue, but probably an encoding issue of the output. How is your XSLT executed? Probably, you will have to change the settings for the output writer.
As Oded remarked, this might be an issue with the input reader encoding rather than an output encoding, as, according to the XPath specification, string-length counts characters, so you may be counting the characters of the string converted to more than one character for the Ä. Maybe the input is UTF-8 but your configuration reads it as single byte encoding?
精彩评论