开发者

splitting lucene index into two halves

开发者 https://www.devze.com 2022-12-30 16:43 出处:网络
what is the best way to split an existin开发者_JS百科g Lucene index into two halves i.e. each split should contain half of the total number of documents in the original indexThe easiest way to split a

what is the best way to split an existin开发者_JS百科g Lucene index into two halves i.e. each split should contain half of the total number of documents in the original index


The easiest way to split an existing index (without reindexing all the documents) is to:

  1. Make another copy of the existing index (i.e. cp -r myindex mycopy)
  2. Open the first index, and delete half the documents (range 0 to maxDoc / 2)
  3. Open the second index, and delete the other half (range maxDoc / 2 to maxDoc)
  4. Optimize both indices

This is probably not the most efficient way, but it requires very little coding to do.


Recent versions of Lucene have a dedicated tool to do this (IndexSplitter and MultiPassIndexSplitter under contrib/misc).


A fairly robust mechanism is to use a checksum of the document, modulo the number of indexes, to decide which index it will go into.


This question was one of the first I found when I was researching answers to this problem, so I'm leaving my solution here for future generations. In my case, I needed to split my index along specific lines, not arbitrarily down the middle or into thirds or what have you. This is a C# solution using Lucene 3.0.3.

My app's index is over 300GB in size, which was becoming a little unmanageable. Each document in the index is associated to one of the manufacturing plants that uses the app. There is no business reason that one plant would ever search for another plant's data, so I needed to cleanly divide the index along those lines. Here's the code I wrote to do so:

var distinctPlantIDs = databaseRepo.GetDistinctPlantIDs();
var sourceDir = GetOldIndexDir();
foreach (var plantID in distinctPlantIDs)
{
    var query = new TermQuery(new Term("PlantID", plantID.ToString()));
    var targetDir = GetNewIndexDirForPlant(plantID); //returns a unique directory where this plant's index will go

    //read each plant's documents and write them to the new index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceSearcher = new IndexSearcher(sourceDir, true))
    using (var destWriter = new IndexWriter(targetDir, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        var numHits = sourceSearcher.DocFreq(query.Term);
        if (numHits <= 0) continue;
        var hits = sourceSearcher.Search(query, numHits).ScoreDocs;
        foreach (var hit in hits)
        {
            var doc = sourceSearcher.Doc(hit.Doc);
            destWriter.AddDocument(doc);
        }
        destWriter.Optimize();
        destWriter.Commit();
    }

    //delete the documents out of the old index
    using (var analyzer = new StandardAnalyzer(Version.LUCENE_30, CharArraySet.EMPTY_SET))
    using (var sourceWriter = new IndexWriter(sourceIndexDir, analyzer, false, IndexWriter.MaxFieldLength.UNLIMITED))
    {
        sourceWriter.DeleteDocuments(query);
        sourceWriter.Commit();
    }
}

That part that deletes the records out of the old index is there because in my case, one plant's records took up the majority of the index (over 2/3rds). So in my real version there is some extra code to do that plant last, and instead of splitting it out like the others it will optimize the remaining index (which is just that plant) and then move it to its new directory.

Anyway, hope this helps someone out there.

0

精彩评论

暂无评论...
验证码 换一张
取 消