开发者

Indexing texts with many numbers in Lucene

开发者 https://www.devze.com 2023-02-06 00:42 出处:网络
Is it OK to create a term for each number in a text? Example text: I got 2295910 unique terms. The numbers can be timestamps, port numbers, anything. The unique numbers lead to a very large number

Is it OK to create a term for each number in a text? Example text:

I got 2295910 unique terms.

The numbers can be timestamps, port numbers, anything. The unique numbers lead to a very large number of unique terms. It does not feel right to have the same number of unique terms as documents. Lucene memory usage grows with the number of unique terms.

Is there a special analyzer or a trick for texts with numbers? The StandardAnalyzer creates a term for 开发者_JS百科each unique number.

The needs:

The numbers should remain searchable. There could be multiple numbers in a document. The memory usage is the issue. I have 800M documents in multiple index directories. The memory usage forces me to close the least recently used IndexSearchers.

Untested ideas:

  • Use a special analyzer. It would split the numbers into chunks. 123456 would become "123 456". The query parser would use a phrase search to find a number.
  • Change Lucene code to use a bigger termInfosIndexDivisor when seeing numeric terms.

Maybe I'm reinventing the wheel. Was it solved by somebody already?


Are you currently having a memory problem? It is true that Lucene memory usage grows with the number of unique terms, but it's still a relatively minuscule amount of memory even for indices that have a lot a terms.

If memory is an issue and you've profiled your code to ensure that it is indeed Lucene that is the problem, you can create another Analyzer that throws away numeric terms. If you do that, obviously, you won't be able to search for documents using numbers.


As Bajafresh says: premature optimization is the root of all evil. But supposing this really is a problem:

One option is to duplicate the field and analyze once throwing out numbers, and the other time throwing out everything but numbers, then indexing the latter as a numeric field. Numeric fields have a special storage mechanism, which means that only a very few unique terms will be stored (usually less than 256, at the cost of some precision).

Of course, this will mean that phrase queries will not work, but other kinds should still be fine (assuming you mess with the query parser enough to get this to work).


The answer depends on your needs.

Do you need to search on these terms? If you need to search on these terms, then this is just the nature of your search index. There are some tricks you can do if you don't need to search exact values (like range searches), but if you need exact matches, then you are stuck with this.

If you don't need to search these terms, why index them?

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号