开发者

Hints on implementing XQuery full-text search using Lucene

开发者 https://www.devze.com 2022-12-16 22:29 出处:网络
I\'ve used Lucene on a previous project, so I am somewhat familiar with the API. However, I\'ve never had to do anything \"fancy\" (where \"fancy\" means things like using filters, different analyzers

I've used Lucene on a previous project, so I am somewhat familiar with the API. However, I've never had to do anything "fancy" (where "fancy" means things like using filters, different analyzers, boosting, payloads, etc).

I'm about to embark on implementing the full-text search feature of XQuery:

http://www.w3.org/TR/xpath-full-text-10/

Its query abilities are the most complicated I've seen. From my experience with Lucene, I know it can be used to implement some of the features; however, I'd like to walk through them all. For each feature, I only need a simple answer like, "Feature X is best implemented using a query filter," just so I start off in the right direction for each feature.

Note: I will be implementing my own query parser and construct queries "by hand" using various instantiations of Lucene classes.

3.3 Cardinality Selection

This allows you to say things like:

title ftcontains "usability" occurs at least 2 times

which means that the title field must contain the "usability" at least twice. How can this be done?

3.4.4 Stemming Option

This allows you to match words that have been indexed against words in the query that have been stemmed like:

title ftcontains "improve" with stemming

which would match even if title contained "improving". Note that PorterStemFilter can not be used because the decision whether to use stemming or not is specified at query-time and not index-time.

In this case, would I have to add each word to the index twice? Once for the original word and once for the stemmed word (assuming the stemmed word is different from the original word)? Or is there a better way?

3.4.5开发者_开发技巧 Case Option

This allows you to specify -- at query-time -- one of "case insensitive", "case sensitive", "lowercase", "uppercase".

The last two I think can be implemented using a query filter since, for "lowercase", it matches only if the document text is all in lower-case (and same for "uppercase").

But how would you handle the case insensitive/sensitive specifications? One thought is to add every word twice: once in its original case and once in a normalized case (arbitrarily chosen to be, say, lowercase). Any better ideas?

3.4.6 Diacritics Option

This is similar to the Cast Option except its "diacritics insensitive" or "diacritics sensitive. How about implementing this?

3.4.7 Stop Word Option

This allows you to specify -- qt query time -- "with stop words", e.g.:

abstract ftcontains "propagating of errors"
with stop words ("a", "the", "of")

would match a document with an abstract that contains "propagating few errors". It seems odd, I know. It's as if the stop words become wildcards, i.e.:

"propagating of errors" -> "propagating * errors"

where * will match any word in the document. How can this be implemented in Lucene?

3.5.3 Mild-Not Selection

XQuery has two flavors of "not": (regular) not and mild-not. This allows you to have a query like:

body ftcontains "Mexico" not in "New Mexico"

which would only match documents that contain "Mexico" when it's not part of the phrase "New Mexico". I would guess that you could use a query filter for this, yes?

3.6.1 Ordered Selection

This allows you to require that the order of the words in a query match the order of the words in a document, e.g.:

title ftcontains ("web site" ftand "usability") ordered

which would match only if the phrase "web site" and the word "usability" both occurred in the document and "usability" comes after "web site" in word order. The Lucene SpanQuery class must have access to word positions, yes? How do you access those?

3.6.4 Scope Selection

This allows you to require that words appear in the same "scope", e.g.:

abstract ftcontains "usability" ftand "web site" same sentence

You can also do any combination of {same|different} {sentence|paragraph}. My guess for this would also be to keep track of sentence/paragraph data in a payload. Yes?

3.7 Ignore Option

Given the partial XQuery:

let $x := <book>
  <title>Web Usability and Practice</title>
  <author>Montana <annotation> this author is
      an expert in Web Usability</annotation> Marigold
  </author>
  <editor>Vera Tudor-Medina on Web <annotation> best
      editor on Web Usability</annotation> Usability
  </editor>
</book>

if I were to have a query:

book ftcontains "Web Usability" without content $x//annotation

then it would not consider any text inside of elements at all. "Web Usability" would be found twice: once in the title element and once in the editor element. Note that the latter element comes smack in the middle of the phrase "Web Usability". My guess for this would also be to use payload data to store the element each word is inside of then use a filter based on that. Yes?


I realize this is a lot, but any pointers appreciated. Thanks!


You might be interested in checking out the Lux project I just released on GitHub: https://github.com/msokolov/lux. It integrates the Saxon XQuery processor and Lucene/Solr, providing full text search capabilities via XQuery. The approach I took was to provide a search function that exposes Lucene query functionality directly, rather than to implement XQuery fulltext as such. However, I believe xqft could be implemented using a similar approach. Lux includes two kinds of indexes: path indexes (which include oelement and attribute names), and text indexes, which include node names as part of the token text (not in a payload). This makes it easy to use existing Lucene Queries.

But to answer your question better: I am pretty sure 3.3 can be implemented using a SpanNearQuery with a large slop.

For 3.4, 3.5, 3.6, and 3.7: In order to allow for query-time analysis choices (like stemming, case-sensitivity, etc) there are two possibilities: create multiple fields, one for each choice of analysis option, or add multiple tokens at the same position for each combination of analysis options. However with the second option, you would also need to add some information to each token to indicate which analysis setting was used to create it, and Lucene doesn't give you any help there - you have to play hacks like adding a payload or prefixing the term text somehow.

Hmm - just noticed that this question was asked 2 years ago and never answered. Well - it's clearly a big project!

0

精彩评论

暂无评论...
验证码 换一张
取 消