开发者

Duplicates in Solr index - items added twice or more times

开发者 https://www.devze.com 2023-03-17 13:59 出处:网络
Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.

Consider you have a Solr index with approx. 20 Million items. When you index these items they are added to the index in batches.

Approx 5 % of all these items are indexed twice or more times, therefore causing a duplicates problem.

If you check the log, you can actually see these items are indeed added twice (or more). Often with an interval of 2-3 minutes between them, and other items between them too.

The web server which trigger开发者_高级运维s the indexing is in a load balanced environment (2 web servers). However, the web server who does the actual indexing is a single web server.

Here are some of the config elements in solrconfig.xml:

<indexDefaults>
.....
<mergeFactor>10</mergeFactor>
<ramBufferSizeMB>128</ramBufferSizeMB>
<maxFieldLength>10000</maxFieldLength>
<writeLockTimeout>1000</writeLockTimeout>
<commitLockTimeout>10000</commitLockTimeout>

<mergePolicy class="org.apache.lucene.index.LogByteSizeMergePolicy">
<double name="maxMergeMB">1024.0</double>
</mergePolicy>

<mainIndex>
<useCompoundFile>false</useCompoundFile>
<ramBufferSizeMB>128</ramBufferSizeMB>
<mergeFactor>10</mergeFactor>

I'm using Solr 1.4.1 and Tomcat 7.0.16. Also I'm using the latest SolrNET library.

What might cause this duplicates problem? Thanks for all input!


To answer your question completely i should be able to know the schema. There is a unique id field in the schema that works more like the unique key in the db, make sure the unique identifier of the document is made the unique key then the duplicates will be overwritten to keep just one value.


It is not possible to have two documents with identical value in their field marked as a unique id in the schema. Adding two documents with the same value will just result in the latter one overwriting (replacing) the previous.

So it sounds like it is your mistake and the documents are not really identical.

Make sure your schema and id fields are correct.


As a completion to what was said above, a solution, in this case, can be to generate a unique ID (or to define one of the fields as an unique ID) for the document from code, before sending it to SOLR.

In this case you make sure that the document you want to update will be overwritted and not recreated.


Actually, all added documents will have an auto generated unique key, through Solr's own uuid type:

<field name="uid" type="uuid" indexed="true" stored="true" default="NEW"/>

So any document added to the index will be considered a new one, since it gets a GUID. However, I think we've got a problem with some other code here, code that adds items to the index when they are updated, instead of just updating them..

I'll be back! Thanks so far!


Ok, it turned out there was a couple of bugs in the code updating the index. Instead of updating, we always had a document added to index, even tho it already existed.

It wasn't overwritten because every document in our Solr index has its own GUID.

Thank you for your answers and time!

0

精彩评论

暂无评论...
验证码 换一张
取 消