So right now I am using the SAX parser in Java to parse the "document.xml" file located within a .docx file's archive. Below i开发者_StackOverflows a sample of what I am trying to parse...
Sample XML Document
<w:pStyle w:val="Heading2" />
</w:pPr>
<w:bookmarkStart w:id="0" w:name="_Toc258435889" />
<w:bookmarkStart w:id="1" w:name="_Toc259085121" />
<w:bookmarkStart w:id="2" w:name="_Toc259261685" />
- <w:r w:rsidRPr="00415FD6">
<w:t>Text To Extract</w:t>
</w:r>
<w:bookmarkEnd w:id="0" />
<w:bookmarkEnd w:id="1" />
<w:bookmarkEnd w:id="2" />
Right now, I know how to take out attribute values, that's not hard. However, I do not know how to get in and parse the actual text within the nodes. Does anyone have any ideas or prior experience with this? Thank you in advance.
Read this article on SAX parsing (it is old but still valid), pay particular attention to how the characters
method is implemented. It is very unintuitive and trips everybody up, you will get multiple calls to characters
for what seems like no good reason.
Also the Java tutorial on SAX has a short explanation of the characters method:
Parsers are not required to return any particular number of characters at one time. A parser can return anything from a single character at a time up to several thousand and still be a standard-conforming implementation. So if your application needs to process the characters it sees, it is wise to have the characters() method accumulate the characters in a java.lang.StringBuffer and operate on them only when you are sure that all of them have been found.
In your case (XML with no mixed-content) that means storing the results of multiple characters() calls until the next call to endElement.
See the characters() ContentHandler method. Read the javadoc carefully - you can get multiple calls when you might expect only one.
精彩评论