开发者

OOXML SDK illegal character replacements

开发者 https://www.devze.com 2023-03-03 03:31 出处:网络
I am having an issue with creating an XLSX document with the Open XML SDK 2.0 from MS. My issue is that I need to display some of these illegal characters in the excel sheet, but if I just add them t

I am having an issue with creating an XLSX document with the Open XML SDK 2.0 from MS.

My issue is that I need to display some of these illegal characters in the excel sheet, but if I just add them to the sheet, the document will not load.

I am using this function

    private static string ProcessString(string str)
    {
        return System.Security.SecurityElement.Escape(str);
    }

Which will give me Tom's ball instead of Tom's ball. (Well I haven't figured out how to get the latter as the excel 开发者_开发技巧generated won't open.)

Anybody know how to make the illegal XML characters show using OOXML in an Excel sheet?

EDIT:

In function I am using to create a text cell is:

private static Cell CreateTextCell(string header, UInt32 index, string text)
{
    var c = new Cell { DataType = CellValues.String, CellReference = header + index };
    var cellValue = new CellValue(text);
    c.Append(cellValue);
    return c;
}

I know it has to do with illegal characters because when I didn't include a particular field in my text it worked, then when I included it, Excel would give me a parser error and a blank document.

The text that I deal with also happens to have HTML tags in it as well.

ps. lol, I just noticed that the markdown used parsed my HTML escape making my example look ridiculous.

edit 2:

Some example of input:

  • Cancer's Complexity: Are we Looking at the Wrong Levels to Develop Effective Interventions?

  • Prospective study of breast cancer risk in mutation-negative women from <i>BRCA1</i> or <i>BRCA2</i> mutation-positive families in the Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer (kConFab).

  • Germline <em>BRCA2</em> mutations correlate with aggressive prostate cancer and adverse outcome.

The html formatting is basically so it displays on the web page. I should just strip off the basic formatting tags. But more importantly, I want the excel file to load and escaping the values is a sure way of doing just that.


Think about that everything contained in e.g. an Excel-file is ultimately persisted as XML. So if you have any invalid XML-chars that you'd like to persist in the file, you need to excape them. It is not really an OOXML issue - it is an XML issue.

So if you have input text like

 Germline <em>BRCA2</em> mutations correlate with aggressive prostate cancer and adverse outcome.

You'd have to convert the angle brackets to

 Germline & lt;em& gt;BRCA2& lt;/em& gt; mutations correlate with aggressive prostate cancer and adverse outcome.

(space added to illustrate the brackets)

See String escape into XML for a few ways to do this.

Also, to see how Microsoft Office does this, add the trouble-some text to an Excel-document and save it. Then use the OpenXml SDK tool (comes with the SDK) to reflect the file and see how it is done.


Are you sure this is what is causing the problem? Can you add "normal" strings to the cells and open it?

AFAIK the apostrophe character is not an illegal XML character.

If you look in the OOXML specification in section 22.9.2.19 ST_Xstring (Escaped String) (the data type for strings in cells) you will see the following explanation:

*22.9.2.19 ST_Xstring (Escaped String) String of characters with support for escaped invalid-XML characters. For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format xHHHH, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it must be escaped as x0008. end example]*


Another thing to be careful of. XML does not have built-in support for all of the character named entities that we are accustomed to using in HTML. There are a small number of always-understood ones in XML (i.e., &lt; &gt; &amp;). Since there are no DTDs in OOXML, you can't define more named entities. Instead you have to introduce anything else into a string using a character entity (e.g., &#x20; for a space, &#160; for &nbsp; etc.) or else directly enter the Unicode character into the string.

You can use &#034; for &quote; and &#039; for apostrophe if you need to get one of those into an attribute value that is surrounded by the same kind of quote symbol.

There are some Unicode code points that are flat-out forbidden in XML data streams. To insert those codes into the value of a string, OOXML has its own escape mechanism for those codes and those codes only. I don't think that provision is involved in this question.

0

精彩评论

暂无评论...
验证码 换一张
取 消