开发者

Is it possible to iterate through documents stored in Lucene Index?

开发者 https://www.devze.com 2022-12-21 01:54 出处:网络
I have some documents stored in a Lucene index with a docId field. I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get

I have some documents stored in a Lucene index with a docId field. I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size开发者_开发知识库 500. Is it possible to do so?


IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
    if (reader.isDeleted(i))
        continue;

    Document doc = reader.document(i);
    String docId = doc.get("docId");

    // do something with docId here...
}


Lucene 4

Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
    if (liveDocs != null && !liveDocs.get(i))
        continue;

    Document doc = reader.document(i);
}

See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html


There is a query class named MatchAllDocsQuery, I think it can be used in this case:

Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);


Document numbers (or ids) will be subsequent numbers from 0 to IndexReader.maxDoc()-1. These numbers are not persistent and are valid only for opened IndexReader. You could check if the document is deleted with IndexReader.isDeleted(int documentNumber) method


If you use .document(i) as in above examples and skip over deleted documents be careful if you use this method for paginating results. i.e.: You have a 10 docs/per page list and you need to get the docs. for page 6. Your input might be something like this: offset=60,count = 10 (documents from 60 to 70).

    IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
    if (reader.isDeleted(i))
        continue;

    Document doc = reader.document(i);
    String docId = doc.get("docId");
}

You will have some problems with the deleted ones because you should not start from offset=60, but from offset=60 + the number of deleted documents that appear before 60.

An alternative I found is something like this:

    is = getIndexSearcher(); //new IndexSearcher(indexReader)
    //get all results without any conditions attached. 
    Term term = new Term([[any mandatory field name]], "*");
    Query query = new WildcardQuery(term);

    topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
    is.search(query, topCollector);

   TopDocs topDocs = topCollector.topDocs(offset, count);

note: replace text between [[ ]] with own values. Ran this on large index with 1.5million entries and got random 10 results in less than a second. Agree is slower but at least you can ignore deleted documents if you need pagination.

0

精彩评论

暂无评论...
验证码 换一张
取 消