I have to store many gigabytes of data across multiple machines. The files are uniquely identified by Guid and one file can be hosted on one machine only. I was wondering if I could use the Guid as a partition key to determine which 开发者_运维问答machine should I use to store the data. If so, what would be my partition function?
Otherwise, how could I partition my data in such way that all the machine get a very similar load?
Thanks!
P.S. I am not using Sql Server, Oracle or any other DB. This is all in-house code. P.S.S. The Guid are generated using the .NET function Guid.NewGuid().
As James said in his comment, you need something that has a good, uniform distribution. Guids do not have this property. I would recommend a hash, even one as simple as a hash of the Guid itself.
A SHA-1 hash has a good distribution. I wouldn't recommend even/odd hashing unless you plan on only distributing between 2 machines.
Because GUIDs are random you could distribute them by storing the odd GUIDs on one machine and the even GUIDs on the other...
static void Main(string[] args)
{
var tests = new List<Guid>();
for (int i = 0; i < 100000; i++)
{
tests.Add(Guid.NewGuid());
}
Console.WriteLine("Even: " + tests.Where(g => g.ToByteArray().Last() % 2 == 0).Count());
Console.WriteLine("Odd : " + tests.Where(g => g.ToByteArray().Last() % 2 == 1).Count());
Console.ReadKey(true);
}
Gives a near equal distribution.
EDIT
Indeed this will not work when splitting across more than 2 machines although you could then split again on an other byte being odd or even.
If you want to round robin your distribution I would be looking at the possibility of a synchronized counter which you % the number of machines you have in a classical round robin manner.
The synchronized counter could be a field in a database, it could be a single web service, or a file on the network etc. Anything which could be incremented every time a file gets placed.
精彩评论