I am trying very hard with no luck to take an XML document which is spit out by a proprietary database and transform it into a well-formed XML document which will eventually be indexed by Apache Solr.
I would like to take this XML file and transform it into a Apache Solr format like that below it.
<?xml version="1.0" encoding="UTF-8" ?>
<ecatalogue>
<tuple>
<table name="CatObjectName_tab">
<tuple>
<atom name="CatObjectName">Clog</atom>
</tuple>
</table>
<atom name="CatObjectNumber">2003-39-27A</atom>
<atom name="CatObjectTitle"></atom>
<table name="CatOtherNumbers_tab">
<tuple>
<atom name="CatOtherNumbers">1895.1.117a</atom>
</tuple>
</table>
<table name="ProPlaceName_tab">
<tuple>
<atom name="ProPlaceName">China</atom>
</tuple>
</table>
<table name="CatOtherNumberType_tab">
<tuple>
<atom name="CatOtherNumberType">Other Number</atom>
</tuple>
</table>
<atom name="DatDateMade"></atom>
<atom name="DatEarliestDateMadeOrig"></atom>
<atom name="DatLatestDateMadeOrig"></atom>
</tuple>
<tuple>
<table name="CatObjectName_tab">
<tuple>
<atom name="CatObjectName">Boot</atom>
</tuple>
</table>
<atom name="CatObjectNumber">2003-39-20B</atom>
<atom name="CatObjectTitle"></atom>
<table name="CatOtherNumbers_tab">
<tuple>
<atom name="CatOtherNumbers">1895.1.91b</atom>
</tuple>
</table>
<table name="ProPlaceName_tab">
<tuple>
<atom name="ProPlaceName">China</atom>
</tuple>
</table>
<table name="CatOtherNumberType_tab">
<tuple>
<atom name="CatOtherNumberType">Other Number</atom>
</tuple>
</table>
<atom name="DatDateMade"></atom>
<a开发者_如何学Pythontom name="DatEarliestDateMadeOrig"></atom>
<atom name="DatLatestDateMadeOrig"></atom>
</tuple>
</ecatalogue>
I would like to transform the above into this:
<add>
<doc>
<field name="ProPlaceName">China</field>
<field name="CatObjectTitle"></field>
<field name="CatObjectNumber">2003-39-27A</field>
<field name="CatOtherNumberType">Other Number</field>
<field name="CatOtherNumbers">1895.1.117a</field>
<field name="CatObjectName_tab">Clog</field>
<field name="DatDateMade"></field>
<field name="DatEarliestDateMadeOrig"></field>
<field name="DatLatestDateMadeOrig"></field>
</doc>
<!-- Row 2 -->
<doc>
<field name="ProPlaceName">China</field>
<field name="CatObjectTitle"></field>
<field name="CatObjectNumber">2003-39-20B</field>
<field name="CatOtherNumberType">Other Number</field>
<field name="CatOtherNumbers">1895.1.91b</field>
<field name="CatObjectName_tab">Boot</field>
<field name="DatDateMade"></field>
<field name="DatEarliestDateMadeOrig"></field>
<field name="DatLatestDateMadeOrig"></field>
</doc>
</add>
Is it best to try and use XSL/XSLT or use something like java or another programming language to make the transformation? How would you approach this problem and can you point me in the right direction?
I believe it can be done using XSL. Any help is appreciated.
Here's something that should help. It's fairly simple, and assumes that you are skipping any nested tables...instead only grabbing the atoms within them. It does not sort the fields in any specific order.
<xsl:template match="/">
<add>
<xsl:for-each select="ecatalogue/tuple">
<doc>
<xsl:for-each select=".//atom">
<field name="{@name}"><xsl:value-of select="."/></field>
</xsl:for-each>
</doc>
</xsl:for-each>
</add>
</xsl:template>
Unless you can guarentee that the XML will always be valid I would go with a programming language approach. I gives you more flexbility in how you parse your data. You stated the data came from a proprietary database and that causes me to want the flexibility.
Case in point, what if the database is exporting invalid xml due to a defect. What component would you be able to change sooner?
Why not choose a solution that parses the XML and then creates an object model that can be outputted to the desired format. You could use your own XML/XSLT or templating toolset (POJO/Velocity) to handle the final transformation.
精彩评论