I am trying to refresh a Lucene index in incremental mode that is updating documents that have changed and keeping other unchanged documents as they are.
For updating changed documents, I am deleting those documents using IndexWriter.deleteDocuments(Query)
and then adding updated documents using IndexWriter.addDocument()
.
The Query
object used in the IndexWriter.deleteDocuments
contains approx 12-15 terms. In the process of refreshing the index I also sometimes need to do a FULL refresh by deleti开发者_JS百科ng all the documents using IndexWriter.deleteDocuments
and then adding the new documents.
The problem is when I called IndexWriter.flush()
after say approx 100000 docs deletions, it takes a long time to execute and throws an OutOfMemoryError
. If I disable flushing, the indexing goes fast upto say 2000000 docs deletions and then it throws an OutOfMemoryError
. I have tried to set the IndexWriter.setRAMBufferSizeMB
to 500 to avoid the out of memory error, but with no luck. The index size is 1.8 GB.
First. Increasing the RAM buffer is not your solution. As far as I understand it is a cache and I rather would argue that it is increasing your problem. An OutOfMemoryError is a JVM problem not a problem of Lucene. You can set the RAM buffer to 1TB - if your VM does not have enough memory, you have a problem anyway. So you can do two things: Increase JVM memory or decrease consumption.
Second. Have you already considered increasing heap memory settings? The reason why flushing takes forever is that the system is doing a lot of garbage collections shortly before it runs out of memory. This is a typical symptom. You can check that using a tool like jvisualvm
. You need to install the GC details plugin first, but then you can select and monitor your crazy OutOfMemory app. If you have learned about your memory issue, you can increase maximum heap space like that:
java -Xmx512M MyLuceneApp (or however you start your Lucene application)
But, again, I would use tools to check your memory consumption profile and garbage collection behavior first. Your goal should be to avoid running low on memory, because this causes garbage collection to slow down your application down to no performance.
Third. Now if you increase your heap you have to be sure that you have enough native memory as well. Because if you do not (check with tools like top
on Linux) your system will start swapping to disk and this will hit Lucene performance like crazy as well. Because Lucene is optimized for sequential disk reads and if your system starts to swap, your hard disk will do a lot of disk seeking which is 2 orders of magnitude slower than sequential reading. So it will be even worse.
Fourth. If you do not have enough memory consider deleting in batches. After a 1,000 or 10,000 documents do a flush, then again and again. The reason for this OutOfMemoryError is that Lucene has to keep everything in memory until you do the flush. So it might be a good idea anyway not to allow to flush batches that are too big, to avoid problems in the future.
On the (rare) occasion that I want to wipe all docs from my Lucene index, I find it much more efficient to close the IndexWriter, delete the index files directly and then basically starting a fresh index. The operation takes very little time and is guaranteed to leave your index in a pristine (if somewhat empty) state.
Try to use a smaller RamBufferedSize for your IndexWriter.
IndexWriter calss flush if the buffer full (or number of documents reaches a certain level). By setting the buffer size to a large number, you are implicitly postponing calling flush which can result in having too many documents in the memory.
精彩评论