开发者

How to do a SimpleDB Backup?

开发者 https://www.devze.com 2023-03-10 01:22 出处:网络
I\'m developing a Facebook application that uses SimpleDB to store its data, but I\'ve realized Amazon does not provide a way to backup that data (at least that I know of)

I'm developing a Facebook application that uses SimpleDB to store its data, but I've realized Amazon does not provide a way to backup that data (at least that I know of)

And SimpleDB is slow. You can get about 4 lists per second, each list of 100 records. Not a good way to backup tons of records.

I found some services in the web that offer to do the backup for you, but I'm not comfortable about giving them my AWS Credentials.

So I though about using threads. Problem is that if you do a select for all the keys in the domain, you need to wait for the next_token value of the first page in order to process the second page and so on.

A solution I was thinking for this was to have a new attribute based on the last 2 digits of the Facebook id. So I'd start a thread with a select for "00", another for "01", and so on, potentially having the possibility of running 100 threads and doing backups much faster (at least in theory). A related solution would be to split that domain into 100 domains (so I can backup each one individually), but that would break some of the selects I need to do. Another solution, probably more PHP friendly, would be to use a cron job to backup lets say 10,000 records and save "next_token", then the next job starts at next_token, etc.

Does anyone have a better solution for this? If its a PHP solution it'd be great, but if it involves something else its welcome anyway.

PS: before you mention it, as far as I know, PHP is still not thread safe. And I'开发者_如何学Gom aware that unless I stop the writes during the backup, there will be some consistency problems, but I'm not too worried about it in this particular case.


The approach of creating a proxy shard attribute certainly works, from experience where I am.

Alternatively, what we have done in the past is to break down the backup into a 2 step process, in order to get as much potential for multi-processing as possible (though this is in java and for the write to the backup file we can rely on synchronization to ensure write-safety - not sure what the deal is on php side).

Basically we have one thread which does a select across the data within a domain, but rather than "SELECT * FROM ...", it is just "SELECT itemName FROM ..." to get the keys to the entries needing backing up. These are then dropped into a queue of item keys which a pool of threads read with the getItem API and write in a thread safe manner to the backup file.

This gave us better throughput on a single domain than spinning on a single thread.

Ultimately though, with numerous domains in our nightly backup we ended up reverting back to doing each domain backup in the single thread and "SELECT * FROM domain" type model, mainly because we already had a shedload of threads going on and the thread overburden started to become an issue on the backup processor, but also because the backup program was starting to get dangerously complex.


I've researched this problem as of October 2012. Three major issues seem to govern choice:

  1. There is no 'native' way to ensure a consistent export or import with SimpleDB. It is your responsibility to understand and manage the implications of this w.r.t. your application code.
  2. No managed backup solution is available from Amazon, but a variety of third-party companies offer something in this space (typically with "backup to S3" as an option).
  3. At some volume of data, you'll need to consider a multi-threaded approach which, again, has important implications re: consistency.

If all you need is to dump data from a single domain and your data volumes are low enough such that single-threaded export makes sense, then here is some Python code I wrote which works great for me. No warranty is expressed or implied, only use this if you understand it:

#simpledb2json.py

import boto
import simplejson as json

AWS_KEY = "YOUR_KEY"
AWS_SECRET = "YOUR_SECRET"

DOMAIN = "YOUR_DOMAIN"


def fetch_items(boto_dom, dom_name, offset=None, limit=300):
    offset_predicate = ""

    if offset:
        offset_predicate = " and itemName() > '" + offset + "'"

    query = "select * from " \
        + "`" + dom_name + "`" \
        + " where itemName() is not null" \
        + offset_predicate \
        + " order by itemName() asc limit " + str(limit)

    rs = boto_dom.select(query)

    # by default, boto does not include the simpledb 'key' or 'name' in the
    # dict, it is a separate property. so we add it:
    result = []
    for r in rs:
        r['_itemName'] = r.name
        result.append(r)

    return result


def _main():
    con = boto.connect_sdb(aws_access_key_id=AWS_KEY, aws_secret_access_key=AWS_SECRET)

    dom = con.get_domain(DOMAIN)

    all_items = []
    offset = None

    while True:
        items = fetch_items(dom, DOMAIN, offset=offset)

        if not items:
            break

        all_items += items

        offset = all_items[-1].name

    print json.dumps(all_items, sort_keys=True, indent=4)

if __name__ == "__main__":
    _main()
0

精彩评论

暂无评论...
验证码 换一张
取 消