I have implemented Lucene on my website. About once every 4 days my search index breaks. I get an error saying that the index is unreadable and the site shows a 500 error to users.
I SSH in, rebuild my index and eveything goes back to normal.
The only part of this project which is slightly different to normal is the high number of writes I am doing to the DB. I am incrementing a ViewCount field on every page view. I p开发者_Go百科resume Lucene updates the document every time.
Presuming that this is the issue: Is there a way to tell Lucene to NOT update the index when we are simply incrementing the count field?
NB: My project uses sfLucenePlugin within Symfony
NB2: The error message is similar to:
Sep 03 18:52:21
symfony [err] {sfException}
File '/home/username/symfony_project/data/index/MyIndex/en/_1nws_s.del' is not readable.
in /home/username/symfony_project/plugins/sfLucenePlugin/lib/vendor/Zend/Search/Lucene/Storage/File/Filesystem.php
line 59
Are you seeing messages like this in your log files?
Sep 03 18:52:21 symfony [err] {sfException} File '/home/username/symfony_project/data/index/MyIndex/en/_1nws_s.del' is not readable. in /home/username/symfony_project/plugins/sfLucenePlugin/lib/vendor/Zend/Search/Lucene/Storage/File/Filesystem.php line 59
If you are, the key point is probably that your index is being corrupted by the high number of files being open concurrently on your server. This is a limitation that is often encountered on shared hosting, as other users, even if on different virtual servers, add up to a lot of file reads/writes, especially for webserving.
Lucene creates new fragments of the index for each update, and over time this means the index is spread over a number of files, rather than a well-optimised index of just one file. This means the likelihood of a concurrency error increases over time for a badly-optimised index. Optimising often can help, but this can be time-consuming for large indexes and you are still at risk of a concurrency error, even if it's a lower probability.
The trick to solving this is to balance the optimisation schedule using a cronjob, and also as you note, to not update the index for trivial data changes (e.g. modified dates, view counts).
For the latter point, you could create a softUpdate()
method in each of your model classes that form part of the index. Create some logic here which discounts the trivial column updates and does not hook the search-index updates of sfLucenePlugin. Now, this is not as easy as it sounds as sfLucenePlugin uses Propel behaviours which are run 'globally' for your objects...
The solution is to edit the behaviour directly, or drop the behaviour and write your own methods to update the index. Luckily, there is a good example of the functions required to do this in the symfony Jobeet tutorial, day 17: http://www.symfony-project.org/jobeet/1_4/Propel/en/17#chapter_17_sub_the_save_method
The downside here is you may end up needing to 'rebuild' the indexing strategy that you'd neatly formed in sfLucenePlugin's YAML syntax in PHP... The syntax is not hard, but the complexity may be.
I hope this makes sense, and helps in some way.
Are you using NRT? If so, you should never need to explicitly flush to disk. That configuration is very good for high-volume writes.
In any case, it doesn't sound correct that writing a lot breaks the index. Are you sure your code is entirely thread-safe? Every time I've thought that I've found an issue with Lucene's integrity, it has been because my code didn't handle locking properly. (As ajreal suggested, your operating system might be throwing a "too many open files" error or something similar; a rare error like this might not always be handled correctly.)
精彩评论