How do I generate a unique id using Lucene?_问答_开发者

How do I generate a unique id using Lucene?

开发者 https://www.devze.com 2023-02-12 19:14 出处：网络

I am using Lucene to store (as well as index) various documents. Each document needs a persistent unique identifier (to be used as part of a URL).

相关专题：lucene

I am using Lucene to store (as well as index) various documents.

Each document needs a persistent unique identifier (to be used as part of a URL).

If I was using a SQL database, I could use an integer primary key auto_increment (or similar) field to automatically generate a unique id for every record that was 开发者_运维知识库added.

Is there any way of doing this with Lucene?

I am aware that documents in Lucene are numbered, but have noted that these numbers are reallocated over time.

(I'm using the Java version of Lucene 3.0.3.)

As larsmans said, you need to store this in a separate field. I suggest that you make the field indexed as well as stored, and index it using a KeywordAnalyzer. You can keep a counter in memory and update it for each new document.

What remains is the problem of persistence - how to store the maximal id when the Lucene process stops. One possibility is to use a text file which saves the maximal id.

I believe Flexible Indexing will allow you to add the maximal id to the index as a "global" field. If you are willing to work with Lucene's trunk, you can try flexible indexing to see whether it fits the bill.

For similar situations, I use following algorithm (has nothing to do with Lucene, but you can use it anyway).

Create new AtomicLong. Start with initial value obtained from System.currentTimeMillis() or System.nanoTime()
Each next ID is generated by calling .incrementAndGet or .getAndIncrement on that AtomicLong.
if the system is restarted, AtomicLong is again initialized to current timestamp during the startup.

Pros: simple, effective, thread-safe, non-blocking. If you need clustered id support, just add space for hi/lo algorithm on top of existing long or sacrifice some high bytes.

Cons: does not work if the frequency of adding new entities if more than 1/ms (for System.currentTimeMillis()) or 1/ns (for System.nanoTime()). Does not tolerate clock abnormalities.

Can consider using UUID as yet another alternative. Probability of a duplicate in UUID is virtually non-existant.

EDIT: Several commenters have raised possible issues with this approach and I don't have time to test it thoroughly. I'm leaving it here because Yuval F. refers to it. Please don't downvote unnecessarily.

Given an IndexWriter w, you can use w.maxDoc() + 1 as an id and store that (as a string) in a separate Field. Make sure the Field is stored.

Try to find a unique value in the data source you are indexing, and store it in the lucene document. A data source could be a mysql database, files from a file system, etc.

For example, if you are indexing content from a mysql database, you can assemble a unique id using the tablename and primary key id "tablename_rowID".

Lets say you are indexing from two tables 'pages' and 'comments' table; for every row in the pages table, you can generate a unique id using "page_28" for row with id 28 in your pages table. Similarly, lets say you index row 36 in comments table, your unique id would be "comment_36".

If all options fail, then I would stick to a UUID. With some additional paranoia, this could be a UUID appended to a timestamp of now().