<![CDATA[
and ]]>
are not allowed inside a <![CDATA[ … ]]>
block. That is understandable.
Now, I have to transmit user entered data inside a <![CDATA[ … ]]>
block. And a malicious user might开发者_如何学运维 enter either <![CDATA[
or ]]>
or both.
The question is: what is the preferred way to handle this situation?
- Strip
<![CDATA[
and]]>
? - Replace it with spaces?
- Smack the user with an error message?
- Or is there an official way of actually transmitting them?
A CDATA section can technically contain another starting tag -- <![CDATA[
-- it's just interpreted as character data. What it can't contain is ]]>
. The usual approach is just to split the CDATA at ]]>
in the user-supplied data when encoding. From Wikipedia:
A CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:
<![CDATA[]]]]><![CDATA[>]]>
This means that to encode "]]>" in the middle of a CDATA section, replace all occurrences of "]]>" with the following:
]]]]><![CDATA[>
This effectively stops and restarts the CDATA section.
[End Wikipedia quote]
See what that's doing? Effectively, what you end up with is:
<![CDATA[ ]] ]]>
<![CDATA[ > ]]>
(Spaces added for emphasis.) So, you get the ]]>
encoded as a ]]
next to a >
-- when put back together during the decoding by your XML processor, you'll end up with the ]]>
as character data, but a ]]>
never actually occurs in your CDATA section.
However, there shouldn't be any need, in this day and age, for you to be worrying about this. Whatever tool/library you're using to create XML should simply manage this for you, and if you throw character data into an element of your XML, the conversion to character data should be done automatically in the way the XML library sees fit, with all the necessary escaping, without you having to think about it.
It's good to be concerned about malicious user data, but the best way to deal with it in this case is to properly use a mature library where someone's already been concerned about it for you.
I think you are thinking about CDATA sections in the wrong way - CDATA stands for "Character data" and the CDATA syntax is simply syntax for a block of data that shouldn't be interpreted as markup. CDATA sections are useful for embedding xml documents inside another xml document, however when including character data (i.e. text) in a document it shouldn't change the meaning of the data if it is enclosed in a CDATA section over simply being encoded as text data (possibly with certain characters escaped).
The short version of this is that your application shouldn't care whether the data is encoded as CDATA or not. If the text you are encoding isn't overly heavy with xml-like syntax then you may be better off simply escaping &
and <
characters - something that your XML API will probably do for you anyway. For example the InnerText property of XmlNode will escape characters as required.
If you still want to use CDATA tags (escaping a large xml fragment may overly inflate the size of the resulting document) then you only need to escape the code CDATA syntax fragement (]]>
), for example this can be done by simply replacing ]]>
with ]]]]><![CDATA[>
.
Within a CDATA section, replace all ]]>
with ]]]]><![CDATA[>
Use character references instead of CDATA when you have to include that string.
精彩评论