I have a bunch of XML files, along with the DTD, that each have a <TEXT>
section. The DTD for the TEXT
element looks like this:
<!ELEMENT TEXT - - (AGENCY* | ACTION* | SUMMARY* | DATE* | FOOTNAME* | FURTHER* | SIGNER* | SIGNJOB* | FRFILING* | BILLING* | FOOTNOTE* | FOOTCITE* | TABLE* | ADDRESS* | IMPORT* | #PCDATA)+ >
Here is what an example XML file would look like:
<ROOT>
...
<TEXT>
Some text that I want to extract
<SUMMARY> Some more text </SUMMARY>
<AGENCY>
An agency
<SIGNER> Bob Smith </SIGNER>
</AGENCY>
</TEXT>
...
</ROO开发者_开发知识库T>
In the end, I want to extract
Some text that I want to extract Some more text An agency Bob Smith
However, each <TEXT>
block obviously is not the same in terms of the elements / ordering, or how far down you go. Is there a way in Java using DOM that I can do this? I'd prefer to use DOM over SAX, but if it's much easier to use SAX, then so be it.
Thanks in advance
An XSLT stylesheet would work:
UPDATE #2: I doubt this would work for you since you're actually using SGML and not XML. The give-away is that the element declaration you have in your question has tag minimization which is not allowed in XML.
UPDATE: Modified the XML input and XSLT to only display the text in the <TEXT>
structure.
XML INPUT
<ROOT>
<IGNORE>ignore this data</IGNORE>
<TEXT>
Some text that I want to extract
<SUMMARY> Some more text </SUMMARY>
<AGENCY>
An agency
<SIGNER> Bob Smith </SIGNER>
</AGENCY>
</TEXT>
<IGNORE>ignore this data</IGNORE>
</ROOT>
XSLT
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="normalize-space(/ROOT/TEXT)"/>
</xsl:template>
</xsl:stylesheet>
OUTPUT
Some text that I want to extract Some more text An agency Bob Smith
Note: This XSLT only works if TEXT is a child of ROOT. If TEXT might be nested more deeply, you can change the "select" to select="normalize-space(//TEXT)"
.
I'm not a big fan of SAX, but for this, I think it would work nicely.
Just define a sax handler, but only use the characters
method. Then just throw the received characters in a StringBuilder
and you're done.
public class textExtractor extends DefaultHandler {
private StringBuilder sb = new StringBuilder();
public void characters(char[] ch, int start, int length){
for (int i=0; i<length; i++){
sb.append(ch[i]);
}
}
public String getText(){
return sb.toString();
}
}
精彩评论