开发者

Insert thousands entities in a reasonnable time into BigTable

开发者 https://www.devze.com 2023-03-13 02:27 出处:网络
I\'m having some issues when I try to insert the 36k french cities into BigTable. I\'m parsing a CSV file and putting every row into the datastore using this piece of code:

I'm having some issues when I try to insert the 36k french cities into BigTable. I'm parsing a CSV file and putting every row into the datastore using this piece of code:

import csv
from databaseModel import *
from google.appengine.ext.db import GqlQuery

def add_cities():
spamReader = csv.reader(open('datas/cities_utf8.txt', 'rb'), delimiter='\t', quotechar='|')
mylist = []
for i in spamReader:
    region = GqlQuery("SELECT __key__ FROM Region WHERE code=:1", i[2].decode("utf-8"))
    mylist.append(InseeCity(region=region.get(), name=i[11].decode("utf-8"), name_f=strip_accents(i[11].decode("utf-8")).lower()))
db.put(mylist)

It's taking around 5 minutes (!!!) to do it with the local dev server, even 10 when deleting them with db.delete() function. When I try it online calling a test.py page containing add_cities(), the 30s timeout is reached. I'm coming from the MySQL world and I think it's a real shame not to add 36k entities in less than a second. I can be wrong in the way to do it, so I'm refering to you:

Thanks :)


First off, it's the datastore, not Bigtable. The datastore uses bigtable, but it adds a lot more on top of that.

The main reason this is going so slowly is that you're doing a query (on the 'Region' kind) for every record you add. This is inevitably going to slow things down substantially. There's two things you can do to speed things up:

  • Use the code of a Region as its key_name, allowing you to do a faster datastore get instead of a query. In fact, since you only need the region's key for the reference property, you needn't fetch the region at all in that case.
  • Cache the region list in memory, or skip storing it in the datastore at all. By its nature, I'm guessing regions is both a small list and infrequently changing, so there may be no need to store it in the datastore in the first place.

In addition, you should use the mapreduce framework when loading large amounts of data to avoid timeouts. It has built-in support for reading CSVs from blobstore blobs, too.


Use the Task Queue. If you want your dataset to process quickly, have your upload handler create a task for each subset of 500 using an offset value.


FWIW we process large CSV's into datastore using mapreduce, with some initial handling/ validation inside a task. Even tasks have a limit (10 mins) at the moment, but that's probably fine for your data size.

Make sure if you're doing inserts,etc. you batch as much as possible - don't insert individual records, and same for lookups - get_by_keyname allows you to pass in an array of keys. (I believe db put has a limit of 200 records at the moment?)

Mapreduce might be overkill for what you're doing now, but it's definitely worth wrapping your head around, it's a must-have for larger data sets.

Lastly, timing of anything on the SDK is largely pointless - think of it as a debugger more than anything else!

0

精彩评论

暂无评论...
验证码 换一张
取 消