Can a guid be a good partition key?_问答_开发者

开发者 https://www.devze.com 2023-03-22 23:50 出处：网络

I have to store many gigabytes of data across multiple machines. The files are uniquely identified by Guid and one file can be hosted on one machine only. I was wondering if I could use the Guid as a

I have to store many gigabytes of data across multiple machines. The files are uniquely identified by Guid and one file can be hosted on one machine only. I was wondering if I could use the Guid as a partition key to determine which 开发者_运维问答machine should I use to store the data. If so, what would be my partition function?

Otherwise, how could I partition my data in such way that all the machine get a very similar load?

Thanks!

P.S. I am not using Sql Server, Oracle or any other DB. This is all in-house code. P.S.S. The Guid are generated using the .NET function Guid.NewGuid().

As James said in his comment, you need something that has a good, uniform distribution. Guids do not have this property. I would recommend a hash, even one as simple as a hash of the Guid itself.

A SHA-1 hash has a good distribution. I wouldn't recommend even/odd hashing unless you plan on only distributing between 2 machines.

Because GUIDs are random you could distribute them by storing the odd GUIDs on one machine and the even GUIDs on the other...

static void Main(string[] args)
{
    var tests = new List<Guid>();

    for (int i = 0; i < 100000; i++)
    {
        tests.Add(Guid.NewGuid());
    }

    Console.WriteLine("Even: " + tests.Where(g => g.ToByteArray().Last() % 2 == 0).Count());
    Console.WriteLine("Odd : " + tests.Where(g => g.ToByteArray().Last() % 2 == 1).Count());
    Console.ReadKey(true);
}

Gives a near equal distribution.

EDIT

Indeed this will not work when splitting across more than 2 machines although you could then split again on an other byte being odd or even.

If you want to round robin your distribution I would be looking at the possibility of a synchronized counter which you % the number of machines you have in a classical round robin manner.

The synchronized counter could be a field in a database, it could be a single web service, or a file on the network etc. Anything which could be incremented every time a file gets placed.