开发者

Indexing PDF documents in Solr with no UniqueKey

开发者 https://www.devze.com 2023-03-21 02:50 出处:网络
I want to index PDF (and other rich) documents. I am using the DataImportHandler. Here is how my schema.xml looks:

I want to index PDF (and other rich) documents. I am using the DataImportHandler.

Here is how my schema.xml looks:

.........
.........
 <field name="title" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="description" type="text" indexed="true" stored="true" multiValued="false"/>
   <field name="date_published" type="string" indexed="false" stored="true" multiValued="false"/>
   <field name="link" type="string" indexed="true" stored="true" multiValued="false" required="false"/>
   <dynamicField name="attr_*" type="textgen" indexed="true" stored="true" multiValued="false"/>
........
........
<uniqueKey>link</uniqueKey>

As you can see I have set link as the unique key so that when the indexing happens documents are not duplicated again. Now I have the file paths stored in a database and I have set the DataImportHandler to get a list of all the file paths and index each document. 开发者_如何学CTo test it I used the tutorial.pdf file that comes with example docs in Solr. The problem is of course this pdf document won't have a field 'link'. I am thinking of way how I can manually set the file path as link when indexing these documents. I tried the data-config settings as below,

 <entity name="fileItems"  rootEntity="false" dataSource="dbSource" query="select path from file_paths">
   <entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
     <field column="title" name="title" meta="true"/>
     <field column="Creation-Date" name="date_published" meta="true"/>
     <entity name="filePath" dataSource="dbSource" query="SELECT path FROM file_paths as link where path = '${fileItems.path}'">
       <field column="link" name="link"/>
     </entity>
   </entity>
  </entity>

where I create a sub-entity which queries for the path name and makes it return the results in a column titled 'link'. But I still see this error:

WARNING: Error creating document : SolrInputDocument[{date_published=date_published(1.0)={2011-06-23T12:47:45Z}, title=title(1.0)={Solr tutorial}}]
org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: link

Is there anyway for me to create a field called link for the pdf documents?

This was already asked here before but the solution provided uses ExtractRequestHandler but I want to use it through the DataImportHandler.


Try this:

<entity name="fileItems"  rootEntity="false" dataSource="dbSource" query="select path from file_paths">
  <field column="path" name="link"/>
  <entity name="tika-test" processor="TikaEntityProcessor" url="${fileItems.path}" dataSource="fileSource">
    <field column="title" name="title" meta="true"/>
    <field column="Creation-Date" name="date_published" meta="true"/>
  </entity>
</entity>
0

精彩评论

暂无评论...
验证码 换一张
取 消