开发者

Extract part of an XML file as plain text using XSLT

开发者 https://www.devze.com 2023-03-13 15:59 出处:网络
Seems like this should be easy, but ... I\'m trying to use XSLT to extract part of an XML file as plain text, throwing away the rest.

Seems like this should be easy, but ...

I'm trying to use XSLT to extract part of an XML file as plain text, throwing away the rest.

So from sample input like this ...

<?xml version="1.0" encoding="UTF-8"?>
<?oxygen RNGSchema="http://segonku.unl.edu/teianalytics/TEIAnalytics.rng"
                        type="xml"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
   <teiHeader type="text">
      <fileDesc>
         <titleStmt>
            <title>Header Title</title>
         </titleStmt>
         <publicationStmt>
            <p>Published</p>
         </publicationStmt>
         <sourceDesc>
            <p>Sourced</p>
         </sourceDesc>
      </fileDesc>
   </teiHeader>
   <text>
      <front>
      </front>
      <body>
         <head>THE TITLE</head>
         <div type="chapter" part="N" org="uniform" sample="complete">
            <head>CHAPTER I</head>
            <p>Some text.</p>开发者_如何转开发;
         </div>
      </body>
   </text>
</TEI>

... I'm trying to get just the text contained within the <body> tags and all their children. The desired output in this case is:

THE TITLE
CHAPTER I
Some text.

Potential complication: <body> can also exist in the <front> matter and/or in the <teiHeader>, so what I really need is the children of <body> if and only if that tag is a child of <text> and of <TEI>.

I've tried really simple XSL like this ...

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:output method="text"/>
    <xsl:template match="/TEI/text/body">
        <xsl:apply-templates select="."/>
    </xsl:template>
</xsl:stylesheet>

... but it gives me plain text of everything in the file, not just the <body> elements.

Thanks!


I've tried really simple XSL like this ...

...

     <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">
         <xsl:output method="text"/>
         <xsl:template match="/TEI/text/body">
             <xsl:apply-templates select="."/>
         </xsl:template>
     </xsl:stylesheet>

... but it gives me plain text of everything in the file, not just the <body> elements.

The reason for this is a famous property/feature of XPath (and reason for many thousands similar questions) to consider any unprefixed name as belonging to "no namespace. However, any element in the provided XML document belongs to the namespace: "http://www.tei-c.org/ns/1.0" and must be accessed as a node in this namespace.

Solution: Define the documents default namespace in the XSLT code (this time with a prefix bound to it) and use the prefix in specifying every name.

This is one of the simplest and shortest possible transformations that produces the wanted result:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:x="http://www.tei-c.org/ns/1.0">
 <xsl:output method="text"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="x:text/x:body//text()">
  <xsl:value-of select="concat(.,'&#xA;')"/>
 </xsl:template>
 <xsl:template match="text()"/>
</xsl:stylesheet>

When applied on the provided XML document:

<TEI xmlns="http://www.tei-c.org/ns/1.0" n="Wright2-0034.sgml.xml">
    <teiHeader type="text">
        <fileDesc>
            <titleStmt>
                <title>Header Title</title>
            </titleStmt>
            <publicationStmt>
                <p>Published</p>
            </publicationStmt>
            <sourceDesc>
                <p>Sourced</p>
            </sourceDesc>
        </fileDesc>
    </teiHeader>
    <text>
        <front>      </front>
        <body>
            <head>THE TITLE</head>
            <div type="chapter" part="N" org="uniform" sample="complete">
                <head>CHAPTER I</head>
                <p>Some text.</p>
            </div>
        </body>
    </text>
</TEI>

the wanted, correct result is produced:

THE TITLE
CHAPTER I
Some text.


You can use:

<xsl:strip-space elements="*"/>

and

<xsl:template match="/" xmlns:n="http://www.tei-c.org/ns/1.0">
    <xsl:for-each select="/n:TEI/n:text/n:body/descendant::*/text()">
        <xsl:value-of select="."/>
        <xsl:if test="position() != last()">
            <xsl:text>&#xa;</xsl:text>
        </xsl:if>
    </xsl:for-each>
</xsl:template>

It returns:

THE TITLE
CHAPTER I
Some text.


Try matching /TEI/text/body//text()

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号