I try to get the list of images in Amazon EC2 inside the Google datastore. I want to realize this with a cron job inside the GAE.
class AmazonEC2uswest(db.Model):
ami = db.StringProperty(required=True)
mani = db.StringProperty()
typ = db.StringProperty()
arch = db.StringProperty()
state = db.StringProperty()
owner = db.StringProperty()
class CronAMIsAmazonUS_WEST(webapp.RequestHandler):
def get(self):
aws_access_key_id_admin = "<secret>"
aws_secret_access_key_admin = "<secret>"
conn_us_west = boto.ec2.connect_to_region('us-west-1', aws_access_key_id=aws_access_key_id_admin,
aws_secret_access_key=aws_secret_access_key_admin, is_secure = False)
liste_images_us_west = conn_us_west.get_all_images()
laenge_liste_images_us_west = len(liste_images_us_west)
for i in range(laenge_liste_images_us_west):
datastore_uswest_AMIs = AmazonEC2uswest(ami=liste_images_us_west[i].id,
mani=str(liste_images_us_west[i].location),
typ=liste_images_us_west[i].type,
arch=liste_images_us_west[i].architecture,
开发者_Python百科 state=liste_images_us_west[i].state,
owner=liste_images_us_west[i].ownerId)
datastore_uswest_AMIs.put()
The problem: Getting the list with get_all_images() lasts only a few seconds. But writing the data to the Google datastore needs way too much CPU time.
My IBM T42p (P4M with 2GHz) needs for that piece of code approx. 1 Minute!
Is it possible to optimize my code in a way that it needs fewer CPU time?
First possible optimisation: create all the entities in your loop, and then call db.put()
with a list of all of them after you're finished. Something like:
entities = []
for i in range(laenge_liste_images_us_west):
datastore_uswest_AMIs = AmazonEC2uswest(...)
entities.append(datastore_uswest_AMIs)
db.put(entities)
or:
db.put([AmazonEC2uswest(...) for image in liste_images_us_west])
If that's still too slow, the right thing to do is probably:
- Get the list of images.
- Divide these up into small batches which can complete comfortably in under 30 seconds. So in your example which is currently taking a minute, you want at least 4 batches, maybe more, and the number of batches should depend on the number of images you get.
- For each batch, add a task to a task queue, specifying which images to add to the DB. This might be done by specifying all the data, or just by specifying a range of images to handle. Which you do depends on being able to store the data temporarily: there's a limit to what you can store in a task, if you go past that you could use memcache, or you could only store the image id, not all the fields. Or you could create more tasks, so that the data for each batch is under the limit.
- In the task handler, process just that batch. If you have all the data then great, otherwise get it again with
get_all_images
. Then generate and store just the entities that belong to this batch.
You don't have to use tasks, cron alone could handle it if you can remember how far you got last time the job ran, and continue from there next time. But tasks seem appropriate to me.
精彩评论