开发者

Possible to rank partial matches in Postgres full text search?

开发者 https://www.devze.com 2022-12-20 23:25 出处:网络
I\'m trying to calculate a ts_rank for a full-text match where some of the terms in the query may not be in the ts_vector against which it is being matched. I would like the rank to be higher in a mat

I'm trying to calculate a ts_rank for a full-text match where some of the terms in the query may not be in the ts_vector against which it is being matched. I would like the rank to be higher in a match where more words match. Seems pretty simple?

Because not all of the terms have to match, I have to | the operands, to give a query such as to_tsquery('one|two|three') (if it was &, all would have to match).

The problem is, the rank value seems to be the same no matter how many words match. In other words, it's maxing rather than multiplying the clauses.

select ts_rank('one two three'::tsvector, to_tsquery('one')); gives 0.0607927.

select ts_rank('one two three'::tsvector, to_tsquery('one|two|three|four')); gives the expected lower value of 0.0455945 because 'four' is not the vector.

But select ts_rank('one two three'::tsvector, to_tsquery('one|two'));

gives 0.0607927 and likewise

select ts_rank('one two three'::tsvector, to_tsquery('one|two|three'));

gives 0.0607927

I would like the result of ts_rank to be higher if more terms match.

Possible?

To counter one possible response: I cannot calculate all possible subsequences of the search query as intersections and then union them all in a query 开发者_Python百科because I am going to be working with large queries. I'm sure there are plenty of arguments against this anyway!

Edit: I'm aware of ts_rank_cd but it does not solve the above problem.


Use the smlar extension (linux only AFAIK, written by the same guys that brought us text search).

It has functions for calculating TFIDF, cosine, or overlap similarity between arrays. It supports indexing so is fast.

Another way would be to "spell-check" the query prior to using it, basically removing any query terms that are not in your corpus.


The conclusion that I have come to is to & the items together for the ranking. In my select query (with which I'm doing the search) the items are |ed. This seems to work.

0

精彩评论

暂无评论...
验证码 换一张
取 消