开发者

How to index different types of xml using DIH in SOLR?

开发者 https://www.devze.com 2023-01-17 02:03 出处:网络
I need to index 5 different kinds of xml files. They share similar structure with slight differences in each of them.

I need to index 5 different kinds of xml files. They share similar structure with slight differences in each of them.

example 1:

<?xml version="1.0"?>

    <manifest> 
  <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>Alberts;Bruce</author> 
                        <author>Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCDEF</category> 
                </categories> 
  </metadata> 
  <resources> 
                <audioresource> 
                        <uuid>123456789</uuid> 
                        <source>03_Mutations_Origin_Cancer.mp3</source> 
                        <mimetype>audio/mpeg</mimetype> 
                        <title>Part Three - Mutations and the Origin of Cancer</title> 
                        <description>123</description> 
                        <chapters> 
                                <chapter>1</chapter> 
                        </chapters> 
                </audioresource> 
  </resources> 
</manifest> 

example 2:

<?xml version="1.0"?> 
<manifest> 
        <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>FN:Alberts;Bruce</author> 
                        <author>FN:Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCGHI</category> 
                </categories> 
        </metadata> 

        <resources> 
                <glossaryresource> 
                        <uuid>123456789</uuid> 
                        <term>A subunit </term> 
                        <definition>The portion of a bacterial exotoxin that interferes with normal host cell function. </definition> 
                        <chapters> 
                                <chapter>10</chapter> 
                        </chapters> 
                </glossaryresource> 
        </resources> 
</manifest> 

My dih-config.xml is as below:

<dataConfig> 
        <dataSource name="fileReader" type="FileDataSource" encoding="UTF-8"/> 
        <document> 
                <entity name="dir" rootEntry="false" dataSource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="X:/tmp/npr"> 
                        <entity name="audioresource" 
                                        rootEntity="true" 
                                        dataSource="fileReader" 
                                        url="${dir.fileAbsolutePath}" 
                                        stream="false" 
                                        logTemplate=" processing ${dir.fileAbsolutePath}" 
                                        logLevel="debug" 
                                        processor="XPathEntityProcessor" 
                                        forEach="/manifest/metadata | /manifest/metadata/authors | /manifest/metadata/categories | /manifest/metadata/resources | /manifest/resources/audioresource | /manifest/resources/audioresource/chapters" 
                                        transformer="DateFormatTransformer"> 

                                        <field column="category" xpath="/manifest/metadata/categories/category" /> 
                                        <field column="author" xpath="/manifest/metadata/authors/author" /> 
                                        <field column="book_title" xpath="/manifest/metadata/title" /> 
                                        <field column="isbn" xpath="/manifest/metadata/isbn"/> 
                                        <field column="id" xpath="/manifest/resources/audioresource/uuid"/> 
                                        <field column="mimetype" xpath="/manifest/resources/audioresource/mimetype" /> 
                                        <field column="title" xpath="/manifest/resources/audioresource/title"/> 
                                        <field column="description" xpath="/manifest/resources/audioresource/description"/> 
                                        <field column="chapter" xpath="/manifest/resources开发者_开发百科/audioresource/chapters/chapter"/> 
                                        <field column="source" xpath="/manifest/resources/audioresource/source"/> 
                        </entity> 
                </entity> 
        </document> 
</dataConfig> 

I'm not quite familiar with xpath. I can't use wildcard in element name, can I? Tried it and it didn't work.

Many thanks in advance.


I'm currently investigating a similar issue. Have you tried creating an XSLT? The entity element has an optional "xsl" attribute.

0

精彩评论

暂无评论...
验证码 换一张
取 消