开发者

How can I extract all PCDATA (text) from an XML file in Java?

开发者 https://www.devze.com 2023-03-06 03:26 出处:网络
I have a bunch of XML files, along with the DTD, that each have a <TEXT> section. The DTD for the TEXT element looks like this:

I have a bunch of XML files, along with the DTD, that each have a <TEXT> section. The DTD for the TEXT element looks like this:

<!ELEMENT TEXT - - (AGENCY* | ACTION* | SUMMARY* | DATE* | FOOTNAME* | FURTHER* | SIGNER* | SIGNJOB* | FRFILING* | BILLING* | FOOTNOTE* | FOOTCITE* | TABLE* | ADDRESS* | IMPORT* | #PCDATA)+ >

Here is what an example XML file would look like:

<ROOT>
  ...
  <TEXT>
  Some text that I want to extract
  <SUMMARY> Some more text </SUMMARY>
  <AGENCY> 
     An agency
     <SIGNER> Bob Smith </SIGNER>
  </AGENCY>
  </TEXT>
  ...
</ROO开发者_开发知识库T>

In the end, I want to extract

Some text that I want to extract Some more text An agency Bob Smith

However, each <TEXT> block obviously is not the same in terms of the elements / ordering, or how far down you go. Is there a way in Java using DOM that I can do this? I'd prefer to use DOM over SAX, but if it's much easier to use SAX, then so be it.

Thanks in advance


An XSLT stylesheet would work:

UPDATE #2: I doubt this would work for you since you're actually using SGML and not XML. The give-away is that the element declaration you have in your question has tag minimization which is not allowed in XML.

UPDATE: Modified the XML input and XSLT to only display the text in the <TEXT> structure.

XML INPUT

<ROOT>
  <IGNORE>ignore this data</IGNORE>
  <TEXT>
    Some text that I want to extract
    <SUMMARY> Some more text </SUMMARY>
    <AGENCY> 
      An agency
      <SIGNER> Bob Smith </SIGNER>
    </AGENCY>
  </TEXT>
  <IGNORE>ignore this data</IGNORE>
</ROOT>

XSLT

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:value-of select="normalize-space(/ROOT/TEXT)"/>
  </xsl:template>

</xsl:stylesheet>

OUTPUT

Some text that I want to extract Some more text An agency Bob Smith

Note: This XSLT only works if TEXT is a child of ROOT. If TEXT might be nested more deeply, you can change the "select" to select="normalize-space(//TEXT)".


I'm not a big fan of SAX, but for this, I think it would work nicely.

Just define a sax handler, but only use the characters method. Then just throw the received characters in a StringBuilder and you're done.

public class textExtractor extends DefaultHandler {

  private StringBuilder sb = new StringBuilder();

  public void characters(char[] ch, int start, int length){
    for (int i=0; i<length; i++){
      sb.append(ch[i]);
    }
  }

  public String getText(){
    return sb.toString();
  }

}
0

精彩评论

暂无评论...
验证码 换一张
取 消