开发者

Distributed LSH (locality sensitive hashing)

开发者 https://www.devze.com 2023-01-21 10:57 出处:网络
I want to build a large scalable database with millions of high dimensional vectors using LSH. Since I have to hold all the data in ram for fast querying, the data must be distributed onto multiple se

I want to build a large scalable database with millions of high dimensional vectors using LSH. Since I have to hold all the data in ram for fast querying, the data must be distributed onto multiple servers to hold all the objects.

A naïve approach would be to spread all objects to different servers and send one query to every server. The server with the best answer properly has the right object.

I'm sure there must be some better solution, where a query don't has to be send to all server nodes and similar objects are grouped together on one server.

What would be a good approach 开发者_如何学运维for distributed LSH tables? Maybe there are even some projects out there?

Thanks for any hint.


First, you want to consider the keys by which the data is to be accessed. It is these keys that you'd want to hash - and, if you know the exact keys you want to access, you can hash them to determine which server to query - eliminating the need to query every server.

Things get harder if you don't know the exact keys (as I suspect to be your situation) - the LSH generates a total ordering for your records - where similar records are likely (but not guaranteed) to have the same hash. I think of this as, for example, a mapping of hyperplanes to the length of their normal vector from the origin... hence, for example, if searching for a similar (but non-identical) hyperplane to one that's between 4 and 5 units from the origin, a good place to start looking is among other hyperplanes between 4 and 5 units from the origin. Hence, if this 'distance from origin' is your locality sensitive hash function, you can shard your data using it, and, in doing so - you could reduce load (while increasing worst case latency) by searching only the shard with a matching 'distance from origin' LCH. With this specific LCH, where similarity is linearly correlated with the hash, it would be possible to get an definitive result while only accessing a subset of the distributed servers. This is not the case for all LSH functions.

IMHO, everything depends upon your LSH function - and that choosing depends upon the specifics of your application.

0

精彩评论

暂无评论...
验证码 换一张
取 消