I'm using C# and the AWSSDK library form Amazon to test a few things in SimpleDB. All going well so far.
However, I am trying to come up with a neat way of retrieving all Attributes that are applicable to a Domain. This is proving to be tricky without having to retrieve an Item, and obviously I can get the list of attributes then. 开发者_运维技巧 But what if I have 100,000 Items in a Domain. Let's say the first 70,000 Items in a "Person" Domain have:
FirstName, LastName, Address
And then I hit a Item that has
FirstName, LastName, Address, Phone
And then I hit another Item around the 80,000 mark which has:
FirstName, LastName, Email, Phone
In the above example, for the Person Domain, how would I get a list that contains:
FirstName, LastName, Address, Email, Phone
...without performing a ridiculous number of select statements?
Many thanks!
You should be able to get a highly accurate list of attributes using a random-sampling approach for domains with many items. Here's some C#-ish pseudo-code:
int domainCount = "select count(*) from Person";
int avgSkipCount = domainCount/2500;
int processedCount = 0;
string nextToken = null;
Set attributeNames;
do
{
int nextSkipCount = Random.Next(0, avgSkipCount*2);
string nextToken = "select count(*) from Person limit " + nextSkipCount;
var countRequest = new SelectRequest
{
NextToken = nextToken,
SelectExpression = "select count(*) from Person limit " + nextSkipCount
};
var countResponse = SimpleDb.Select(countRequest);
nextToken = countResponse.NextToken;
processedCount += countResponse.Count;
var getRequest = new SelectRequest
{
NextToken = nextToken,
SelectExpression = "select * from Person limit 1"
};
var getResponse = SimpleDb.Select(getRequest);
nextToken = getResponse.NextToken;
processedCount++;
attributeNames.Add(getResponse.AttributeNames);
} while (domainCount > processedCount);
This depends on the fact that you can use the NextToken returned from a select count(*) query to skip over records in SimpleDB. Mocky has written an excellent explanation of how to accomplish this. And I've explained how to accomplish efficient paging like this with Simple Savant.
This will give you 99% accuracy with most data sets which should be good enough for most real-world uses. Statistical theory says that a sample size of 2500 gives you effectively the same accuracy for any size data set, so this method scales for even millions of items.
This is obviously not ideal as it still requires a high number of queries, but you should be able to accomplish the same thing with a much smaller sample size if your data set has a relatively limited number of attribute variations.
精彩评论