开发者

Lucene.Net 2.9.2: OOM exception when adding many documents

开发者 https://www.devze.com 2023-03-31 20:06 出处:网络
I am trying to index about 10.000.000 documents with Lucene.NET 2.9.2. These documents (forum posts of different length) are taken in bulks of 10.000 from a MSSQL database and then passed to my Lucene

I am trying to index about 10.000.000 documents with Lucene.NET 2.9.2. These documents (forum posts of different length) are taken in bulks of 10.000 from a MSSQL database and then passed to my Lucene.NET wrapper class called LuceneCorpus:

public static void IndexPosts(LuceneCorpus luceneCorpus, IPostsRepository postsRepository, int chunkSize)
{
    // omitted: this whole method is executed in a background worker to enable GUI feedback
    // chunkSize is 10.000
    int count = 0;
    // totalSteps is ~10.000.000
    int totalSteps = postsRepository.All.Count();
    while (true)
    {
        var posts = postsRepository.All.Skip(count).Take(chunkSize).ToList();
        if (posts.Count == 0)
            break;
        luceneCorpus.AddPosts(posts);
        count += posts.Count;                   
    }
    luceneCorpus.OptimizeIndex();
}

I read that it is recommended to use a single IndexWriter instead of opening and closing a开发者_如何学运维 new one for each bulk of documents. Therefore, my LuceneCorpus class looks like this:

public class LuceneCorpus
{
    private Analyzer _analyzer;
    private Directory _indexDir;
    private IndexWriter _writer;

    public LuceneCorpus(DirectoryInfo indexDirectory)
    {
        _indexDir = FSDirectory.Open(indexDirectory);
        _analyzer = new StandardAnalyzer(Version.LUCENE_29);
        _writer = new IndexWriter(_indexDir, _analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
        _writer.SetRAMBufferSizeMB(128);
    }

    public void AddPosts(IEnumerable<Post> posts)
    {
        List<Document> docs = new List<Document>();
        foreach (var post in posts)
        {
            var doc = new Document();
            doc.Add(new Field("SimplifiedBody", post.SimplifiedBody, Field.Store.NO, Field.Index.ANALYZED));
            _writer.AddDocument(doc);
        }
        _writer.Commit();
    }

    public void OptimizeIndex()
    {
        _writer.Optimize();
    }
}

Now, my problem is that the memory consumption is constantly filling up until I finally reach an out-of-memory exception after indexing about 700.000 documents somewhere in the IndexPosts method.

As far as I know, the index writer should flush when it either reached the RAMBufferSize (128 MB) or if Commit() is called. In fact, the writer definitely DOES flush and even keeps track of the flushes but the memory keeps filling up nevertheless. Is the writer somehow keeping a reference to the documents so that they aren't garbage collected or what am I missing here?

Thanks in advance!

Edit: I also tried initializing the writer, analyzer and indexDir in the scope of the AddPosts method instead of class-wide but that doesn't prevent the OOM exception either.


Try latest and greatest. It has some mem-leak fixes.

https://svn.apache.org/repos/asf/incubator/lucene.net/branches/Lucene.Net_2_9_4g/src/


I read that it is recommended to use a single IndexWriter instead of opening and closing a new one for each bulk of documents.

That may be true in general, but your special case seems to demand another approach. You should try a writer per batch. Your large memory requirement is forcing you to use a less-than-optimal efficiency solution. Trade memory for speed and visa versa - it's common.


Apparently Lucene wasn't causing the memory leak but the DataContext of my PostsRepository was. I solved it by using a temporary non-tracking DC for each "Take" iteration.

Sorry and thanks anyways!

0

精彩评论

暂无评论...
验证码 换一张
取 消