开发者

Allow invalid HTML characters in XSLT transformation

开发者 https://www.devze.com 2023-01-30 02:59 出处:网络
I am using Saxon & XSLT to transform HTML documents, over which I have no control. These documents may contain characters that should really be encoded, e.g.

I am using Saxon & XSLT to transform HTML documents, over which I have no control.

These documents may contain characters that should really be encoded, e.g.

instead of the encoded

™

As it stands, Saxon is throwing the following exception during the transform, from HTMLEmitter:

else if (c >= 127 && c < 160) {
                       // these control characters are illegal in HTML
                       DynamicError err = new DynamicError(
               开发者_StackOverflow         "Illegal HTML character: decimal " + (int) c);
                        err.setErrorCode("SERE0014");
                        throw err;

Is there anyway to be more lenient, and tell Saxon to ignore and let through these characters as they are - or - how do I configure Saxon to use the XMLEmitter and not the HTMLEmitter?


That character IS invalid in HTML because it will not necessarily render as what you expect, depending on the user's code page. You want to use the correct code point, &#x2122; and make sure to use UTF-8 encoding.

EDIT: character-map

<xsl:character-map name="TM">
  <xsl:output-character character="&#153;" string="&#x2122;"/>
</xsl:character-map>


Saxon is an XSLT processor, not an XML parser. If you get errors parsing input documents then it is the XML parser (and not Saxon) complaining and that means your input is not well-formed XML. On the Java platform if the input is HTML and not XML you might get away with using something like TagSoup http://home.ccil.org/~cowan/XML/tagsoup/ instead of an XML parser.

On the other hand I agree with the comment already made, XNL builds on and supports Unicode so your input document can use Unicode characters as long as the documents are properly encoded and declare the used encoding in the XML declaration. With Unicode the code point of '™' is 8482, not 153. I suppose your input documents use a Windows code page like 1252, in that case your documents need to start with <?xml version="1.0" encoding="Windows-1252"?> to let the XML parser know.


In addition of @Martin Honnen's answer pointing out that 153 is not the UNICODE point for the character ™, but 8482, and @Jim Garrison recomendation of xsl:character-map (if you can't correctly state the character set for your input source), here is the reazon for the error report from http://www.w3.org/TR/xslt-xquery-serialization/#HTML_CHARDATA :

Certain characters, specifically the control characters #x7F-#x9F, are legal in XML but not in HTML. It is a serialization error [err:SERE0014] to use the HTML output method when such characters appear in the instance of the data model. The serializer MUST signal the error.

0

精彩评论

暂无评论...
验证码 换一张
取 消