I receive an XML that I run through an XSLT process each day; however, the occasional special character causes this to break. I am looking for some utility that will clean the XML & replace special characters with correct html numeric encoding. Just need a utility or an idea.
Update from comments
The XML will sometimes include a special ch开发者_开发百科aracter such as ¢ rather than
¢
so I need a way to change the special character to the tag
If your XSLT code can't handle this input XML, then either the input isn't actually XML, or you're presenting it incorrectly to the XSLT processor. The most likely explanation is that the encoding of the file is not what the XML declaration at the start of the file says it is; or perhaps there isn't an XML declaration, so the processor is assuming UTF-8, but it's actually iso-8859-1. The solution may be as simple as adding an XML declaration to the start of the file to declare the encoding as iso-8859-1.
"Special" characters(Unicode characters not in ASCII) are valid XML, so you should really fix your parser. If that doesn't work, pipe your code through the following filter:
#!/usr/bin/env python
import sys
input = sys.stdin.read().decode('UTF-8')
for c in input:
sys.stdout.write('&#%04d;' % ord(c) if c >= 128 else c)
Replace UTF-8
with the document's encoding. Save the above code to xmlentities
, and call like
python xmlentities <broken.xml >fixed.xml
I can't reproduce this problem
This stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
With this input:
<t>¢</t>
Output:
<?xml version="1.0" encoding="UTF-16"?>
<t>¢</t>
精彩评论