UPDATE: I added an answer to this question which incorporates almost all the suggestions which have been given. The original template given in the code below needed 45605ms to finish a real world input document (english text about script programming). The revised template in the community wiki answer brought the runtime down to 605ms!
I'm using the following XSLT template for replacing a few special characters in a string with their escaped variants; it calls itself recursively using a divide-and-conquer strategy, eventually looking at every single character in a given string. It then decide开发者_如何学Pythons whether the character should be printed as it is, or whether any form of escaping is necessary:
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<xsl:when test="$len >= 2">
<xsl:variable name="halflen" select="round($len div 2)"/>
<xsl:variable name="left">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="right">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat($left, $right)"/>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="$s = '"'">
<xsl:text>"\""</xsl:text>
</xsl:when>
<xsl:when test="$s = '@'">
<xsl:text>"@"</xsl:text>
</xsl:when>
<xsl:when test="$s = '|'">
<xsl:text>"|"</xsl:text>
</xsl:when>
<xsl:when test="$s = '#'">
<xsl:text>"#"</xsl:text>
</xsl:when>
<xsl:when test="$s = '\'">
<xsl:text>"\\"</xsl:text>
</xsl:when>
<xsl:when test="$s = '}'">
<xsl:text>"}"</xsl:text>
</xsl:when>
<xsl:when test="$s = '&'">
<xsl:text>"&"</xsl:text>
</xsl:when>
<xsl:when test="$s = '^'">
<xsl:text>"^"</xsl:text>
</xsl:when>
<xsl:when test="$s = '~'">
<xsl:text>"~"</xsl:text>
</xsl:when>
<xsl:when test="$s = '/'">
<xsl:text>"/"</xsl:text>
</xsl:when>
<xsl:when test="$s = '{'">
<xsl:text>"{"</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:value-of select="$s"/>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This template accounts for the majority of runtime which my XSLT script needs. Replacing the above escape-text
template with just
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:value-of select="$s"/>
</xsl:template>
makes the runtime of my XSLT script go from 45 seconds to less than one seconds on one of my documents.
Hence my question: how can I speed up my escape-text
template? I'm using xsltproc and I'd prefer a pure XSLT 1.0 solution. XSLT 2.0 solutions would be welcome too. However, external libraries might not be useful for this project - I'd still be interested in any solutions using them though.
Another (complementary) strategy would be to terminate the recursion early, before the string length is down to 1, if the condition translate($s, $vChars, '') = $s
is true. This should give much faster processing of strings that contain no special characters at all, which is probably the majority of them. Of course the results will depend on how efficient xsltproc's implementation of translate()
is.
A very small correction improved the speed in my tests about 17 times.
There are additional improvements, but I guess this will suffice for now ... :)
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:my="my:my">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:variable name="vChars">"@|#\}&^~/{</xsl:variable>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()" name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<xsl:when test="$len >= 2">
<xsl:variable name="halflen" select="round($len div 2)"/>
<xsl:variable name="left">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="right">
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
</xsl:variable>
<xsl:value-of select="concat($left, $right)"/>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="not(contains($vChars, $s))">
<xsl:value-of select="$s"/>
</xsl:when>
<xsl:when test="$s = '"'">
<xsl:text>"\""</xsl:text>
</xsl:when>
<xsl:when test="$s = '@'">
<xsl:text>"@"</xsl:text>
</xsl:when>
<xsl:when test="$s = '|'">
<xsl:text>"|"</xsl:text>
</xsl:when>
<xsl:when test="$s = '#'">
<xsl:text>"#"</xsl:text>
</xsl:when>
<xsl:when test="$s = '\'">
<xsl:text>"\\"</xsl:text>
</xsl:when>
<xsl:when test="$s = '}'">
<xsl:text>"}"</xsl:text>
</xsl:when>
<xsl:when test="$s = '&'">
<xsl:text>"&"</xsl:text>
</xsl:when>
<xsl:when test="$s = '^'">
<xsl:text>"^"</xsl:text>
</xsl:when>
<xsl:when test="$s = '~'">
<xsl:text>"~"</xsl:text>
</xsl:when>
<xsl:when test="$s = '/'">
<xsl:text>"/"</xsl:text>
</xsl:when>
<xsl:when test="$s = '{'">
<xsl:text>"{"</xsl:text>
</xsl:when>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Here is a more improved version, based on @Dimitre's answer:
<xsl:template match="text()" name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<xsl:when test="$len > 1">
<xsl:variable name="halflen" select="round($len div 2)"/>
<!-- no "left" and "right" variables necessary! -->
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
</xsl:call-template>
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="not(contains($vChars, $s))">
<xsl:value-of select="$s"/>
</xsl:when>
<xsl:when test="contains('\"', $s)">
<xsl:value-of select="concat('"\', $s, '"')" />
</xsl:when>
<!-- all other cases can be collapsed, this saves some time -->
<xsl:otherwise>
<xsl:value-of select="concat('"', $s, '"')" />
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
Should be another tiny bit faster, but I have not benchmarked it. In any case, it's shorter. ;-)
For what it's worth, here's my current version of the escape-text
template which incorporates most of the (excellent!) suggestions which people have given in response to my question. For the record, my original version took about 45605ms on average on my sample DocBook document. After that, the runtime was decreased in multiple steps:
- Removing the
left
andright
variable together with theconcat()
call brought the runtime down to 13052ms; this optimization was taken from Tomalak's answer. - Moving the common case (which is: the given character doesn't need any special escaping) first in the inner
<xsl:choose>
element brought the runtime further down to 5812ms. This optimization was first suggested by Dimitre. - Aborting the recursion early by first testing whether the given string contains any of the special characters at all brought the runtime down to 612ms. This optimization was suggested by Michael.
- Finally, I couldn't resist doing a micro optimization after reading a comment by Dimitre in Tomalak's answer: I replaced the
<xsl:value-of select="concat('x', $s, 'y')"/>
calls with<xsl:text>x</xsl:text><xsl:value-of select="$s"/><xsl:text>y</xsl:text>
. This brought the runtime to about 606ms (so about 1% improvement).
In the end, the function took 606ms instead of 45605ms. Impressive!
<xsl:variable name="specialLoutChars">"@|#\}&^~/{</xsl:variable>
<xsl:template name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<!-- Common case optimization:
no need to recurse if there are no special characters -->
<xsl:when test="translate($s, $specialLoutChars, '') = $s">
<xsl:value-of select="$s"/>
</xsl:when>
<!-- String length greater than 1, use DVC pattern -->
<xsl:when test="$len > 1">
<xsl:variable name="halflen" select="round($len div 2)"/>
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
<xsl:with-param name="len" select="$len - $halflen"/>
</xsl:call-template>
</xsl:when>
<!-- Special character -->
<xsl:otherwise>
<xsl:text>"</xsl:text>
<!-- Backslash and quot need backslash escape -->
<xsl:if test="$s = '"' or $s = '\'">
<xsl:text>\</xsl:text>
</xsl:if>
<xsl:value-of select="$s"/>
<xsl:text>"</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
How about using EXSLT? The String functions in EXSLT have a function called replace. I think it is something that is supported by quite a few XSLT implementations.
Update: I fixed this to actually work; now, it is not a speedup!
Building off @Wilfred's answer...
After fiddling with the EXSLT replace() function, I decided it was interesting enough to post another answer, even if it's not useful to the OP. It may well be useful to others.
It's interesting because of the algorithm: instead of the main algorithm worked on here (doing a binary recursive search, dividing in half at each recursion, pruned whenever a 2^nth substring has no special characters in it, and iterating over a choice of special characters when a length=1 string does contain a special character), Jeni Tennison's EXSLT algorithm puts the iteration over a set of search strings on the outside loop. Therefore on the inside of the loop, it is only searching for one string at a time, and can use substring-before()/substring-after() to divide the string, instead of blindly dividing in half.
[Deprecated: I guess that's enough to speed it up significantly. My tests show a speedup of 2.94x over @Dimitre's most recent one (avg. 230ms vs. 676ms).] I was testing using Saxon 6.5.5 in the Oxygen XML profiler. As input I used a 7MB XML document that was mostly a single text node, created from web pages about javascript, repeated. It sounds to me like that is representative of the task that the OP was trying to optimize. I'd be interested to see hear what results others get, with their test data and environments.
Dependencies
This uses an XSLT implementation of replace which relies on exsl:node-set(). It looks like xsltproc supports this extension function (possibly an early version of it). So this may work out-of-the-box for you, @Frerich; and for other processors, as it did with Saxon.
However if we want 100% pure XSLT 1.0, I think it would not be too hard to modify this replace template to work without exsl:node-set(), as long as the 2nd and 3rd params are passed in as nodesets, not RTFs.
Here is the code I used, which calls the replace template. Most of the length is taken up with the verbose way I created search/replace nodesets... that could probably be shortened. (But you can't make the search or replace nodes attributes, as the replace template is currently written. You'll get an error about trying to put attributes under the document element.)
<xsl:stylesheet version="1.0" xmlns:str="http://exslt.org/strings"
xmlns:foo="http://www.foo.net/something" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:import href="lars.replace.template.xsl"/>
<foo:replacements>
<replacement>
<search>"</search>
<replace>"\""</replace>
</replacement>
<replacement>
<search>\</search>
<replace>"\\"</replace>
</replacement>
<replacement>
<search>@</search>
<replace>"["</replace>
</replacement>
<replacement>
<search>|</search>
<replace>"["</replace>
</replacement>
<replacement>
<search>#</search>
<replace>"["</replace>
</replacement>
<replacement>
<search>}</search>
<replace>"}"</replace>
</replacement>
<replacement>
<search>&</search>
<replace>"&"</replace>
</replacement>
<replacement>
<search>^</search>
<replace>"^"</replace>
</replacement>
<replacement>
<search>~</search>
<replace>"~"</replace>
</replacement>
<replacement>
<search>/</search>
<replace>"/"</replace>
</replacement>
<replacement>
<search>{</search>
<replace>"{"</replace>
</replacement>
</foo:replacements>
<xsl:template name="escape-text" match="text()" priority="2">
<xsl:call-template name="str:replace">
<xsl:with-param name="string" select="."/>
<xsl:with-param name="search"
select="document('')/*/foo:replacements/replacement/search/text()"/>
<xsl:with-param name="replace"
select="document('')/*/foo:replacements/replacement/replace/text()"/>
</xsl:call-template>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
The imported stylesheet was originally this one.
However, as @Frerich pointed out, that never gave the correct output! That ought to teach me not to post performance figures without checking for correctness!
I can see in a debugger where it's going wrong, but I don't know whether the EXSLT template never worked, or if it just doesn't work in Saxon 6.5.5... either option would be surprising.
In any case, EXSLT's str:replace() is specified to do more than we need, so I modified it so as to
- require that the input parameters are already nodesets
- as a consequence, not require exsl:node-set()
- not sort the search strings by length (they're all one character, in this application)
- not insert a replacement string between every pair of characters when the corresponding search string is empty
Here is the modified replace template:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:str="http://exslt.org/strings">
<!-- By Lars Huttar
based on implementation of EXSL str:replace() by Jenni Tennison.
http://www.exslt.org/str/functions/replace/str.replace.template.xsl
Modified by Lars not to need exsl:node-set(), not to bother sorting
search strings by length (in our application, all the search strings are of
length 1), and not to put replacements between every other character
when a search string is length zero.
Search and replace parameters must both be nodesets.
-->
<xsl:template name="str:replace">
<xsl:param name="string" select="''" />
<xsl:param name="search" select="/.." />
<xsl:param name="replace" select="/.." />
<xsl:choose>
<xsl:when test="not($string)" />
<xsl:when test="not($search)">
<xsl:value-of select="$string" />
</xsl:when>
<xsl:otherwise>
<xsl:variable name="search1" select="$search[1]" />
<xsl:variable name="replace1" select="$replace[1]" />
<xsl:choose>
<xsl:when test="contains($string, $search1)">
<xsl:call-template name="str:replace">
<xsl:with-param name="string"
select="substring-before($string, $search1)" />
<xsl:with-param name="search"
select="$search[position() > 1]" />
<xsl:with-param name="replace"
select="$replace[position() > 1]" />
</xsl:call-template>
<xsl:value-of select="$replace1" />
<xsl:call-template name="str:replace">
<xsl:with-param name="string"
select="substring-after($string, $search)" />
<xsl:with-param name="search" select="$search" />
<xsl:with-param name="replace" select="$replace" />
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="str:replace">
<xsl:with-param name="string" select="$string" />
<xsl:with-param name="search"
select="$search[position() > 1]" />
<xsl:with-param name="replace"
select="$replace[position() > 1]" />
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
One of the side benefits of this simpler template is that you could now use attributes for the nodes of your search and replace parameters. This would make the <foo:replacements>
data more compact and easier to read IMO.
Performance: With this revised template, the job gets done in about 2.5s, vs. my 0.68s for my recent tests of the leading competitor, @Dimitre's XSLT 1.0 stylesheet. So it's not a speedup. But again, others have had very different test results than I have, so I'd like to hear what others get with this stylesheet.
After @Frerich-Raabe published a community wiki answer which combines the suggestions so far and achieves (on his data) a speedup of 76 times -- big congratulations to everybody!!!
I couldn't resist not to go further:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:variable name="specialLoutChars">"@|#\}&^~/{</xsl:variable>
<xsl:key name="kTextBySpecChars" match="text()"
use="string-length(translate(., '"@|#\}&^~/', '') = string-length(.))"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
<xsl:template match="text()[key('kTextBySpecChars', 'true')]" name="escape-text">
<xsl:param name="s" select="."/>
<xsl:param name="len" select="string-length($s)"/>
<xsl:choose>
<xsl:when test="$len >= 2">
<xsl:variable name="halflen" select="round($len div 2)"/>
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, 1, $halflen)"/>
<xsl:with-param name="len" select="$halflen"/>
</xsl:call-template>
<xsl:call-template name="escape-text">
<xsl:with-param name="s" select="substring($s, $halflen + 1)"/>
<xsl:with-param name="len" select="$len - $halflen"/>
</xsl:call-template>
</xsl:when>
<xsl:when test="$len = 1">
<xsl:choose>
<!-- Common case: the character at hand needs no escaping at all -->
<xsl:when test="not(contains($specialLoutChars, $s))">
<xsl:value-of select="$s"/>
</xsl:when>
<xsl:when test="$s = '"' or $s = '\'">
<xsl:text>"\</xsl:text>
<xsl:value-of select="$s"/>
<xsl:text>"</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text>"</xsl:text>
<xsl:value-of select="$s"/>
<xsl:text>"</xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:when>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
This transformation achieves (on my data) a further speedup of 1.5 times. So the total speedup should be more than 100 times.
OK, I'll chip in. Though not as interesting as optimizing the XSLT 1.0 version, you did say that XSLT 2.0 solutions are welcome, so here's mine.
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template name="escape-text" match="text()" priority="2">
<xsl:variable name="regex1">[@|#}&^~/{]</xsl:variable>
<xsl:variable name="replace1">"$0"</xsl:variable>
<xsl:variable name="regex2">["\\]</xsl:variable>
<xsl:variable name="replace2">"\\$0"</xsl:variable>
<xsl:value-of select='replace(replace(., $regex2, $replace2),
$regex1, $replace1)'/>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node()|@*"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
This just uses a regexp replace() to replace \ or " with "\" or "\"" respectively; composed with another regexp replace() to surround any of the other escapable characters with quotes.
In my tests, this performs worse than Dimitre's most recent XSLT 1.0 offering, by a factor of more than 2. (But I made up my own test data, and other conditions may be idiosyncratic, so I'd like to know what results others get.)
Why the slower performance? I can only guess it's because searching for regular expressions is slower than searching for fixed strings.
Update: using analyze-string
As per @Alejandro's suggestion, here it is using analyze-string:
<xsl:template name="escape-text" match="text()" priority="2">
<xsl:analyze-string select="." regex='([@|#}}&^~/{{])|(["\\])'>
<xsl:matching-substring>
<xsl:choose>
<xsl:when test="regex-group(1)">"<xsl:value-of select="."/>"</xsl:when>
<xsl:otherwise>"\<xsl:value-of select="."/>"</xsl:otherwise>
</xsl:choose>
</xsl:matching-substring>
<xsl:non-matching-substring><xsl:value-of select="."/></xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
While this seems like a good idea, unfortunately it does not give us a performance win: In my setup, it consistently takes about 14 seconds to complete, versus 1 - 1.4 sec for the replace() template above. Call that a 10-14x slowdown. :-( This suggests to me that breaking and concatenating lots of big strings at the XSLT level is a lot more expensive than traversing a big string twice in a built-in function.
精彩评论