What is the simplest way I can hide a sensitive identifier, while providing some equivalent means of identifying the data from outside?
For example, lets say I have a database table with records and one of them is an sensitive ID field.
ID
2A
1S
etc...
then I want to have a second record:
ID PublicID
2A AXXX44328
1S KKKZJSAAS
such that when I am given a PublicID I can always determine what ID it refers to:
H(PublicID) = ID
but nobody else is able to do so.
Also note, that I want to be able 开发者_如何学Pythonto reproduce the string in at least two different locations. So if I have two servers/database, the ID 2A has to map to string AXX44328 on each one of them independently.
I suspect this is like, encryption - with throwing away a public key?
If your IDs are relatively short (15 bytes or less) then I suggest encrypting them with a block cipher, namely the AES. The AES uses a secret key K, which has length 128, 192 or 256 bits (128 bits are enough). Since AES processes a block of exactly 16 bytes, you have to pad your ID a bit. The "usual" padding (known as "PKCS#5") consists in adding n bytes (n >= 1), all of them having value n, such that the resulting length is appropriate (here, you want a length of 16).
So the transformation of ID (the sensitive data) into S (the scrambled string which can be shown to the public at large) is: S = AESencrypt_K(pad(ID)). The reverse operation is: ID = unpad(AESdecrypt_K(S)). If ID is 16 bytes or more, then encryption will use several invocations of AES, and there are subtleties with regards to how those invocations are linked together. The keyword is chaining mode and the usual answer is "CBC".
Knowledge of the secret key K (the same K) is needed for both operations. This means that whoever can compute S from ID can also compute ID from S, and vice versa.
Now if you need some entities to be able to compute S from ID without giving them the power to do the reverse operation, then things are more complex. In particular, you must not have a deterministic process: if there is a single S which can be computed from ID then anybody can try an exhaustive search on the possible values of ID until a match with a given S is found. So you have to relax the model, in that a given ID may yield a great number of possible scrambled strings S', such that all those S' may be converted back into ID by someone who has the "right" secret value. This is what you would get from asymmetric encryption. The usual asymmetric encryption algorithm is RSA. With a 1024-bit RSA key (a typical size for proper security), ID could have a size up to 117 bytes, and S' will be 128-byte long (the size increase corresponds to the injected random data which makes the process non-deterministic). If 128 bytes are too much, you can get shorter encrypted messages with El-Gamal encryption over elliptic curves (down to about 40 bytes or so, for an up-to-20-byte ID), but you may have a hard time finding an existing implementation.
It's sufficient to generate a random, unique string of some kind and store it in the database as your public ID. Index the table on the public ID and you can easily retrieve the real ID (and other row values) given the public ID. As the database is private, nobody can work out the ID given the public ID.
A simple way to generate the random, unique string is to take a hash (SHA-1 for example) of the real ID + some salt value, e.g.
my $public_id = sha1( $salt . $id );
The $salt
value should be a long, random string that is generated once, kept on the server and never revealed publicly. It makes it very difficult (nearly impossible) for an attacker to reverse engineer the real ID from the public ID by brute-forcing the hash (which can be quite easy without a salt, if the ID is short and numeric)
The advantage of this approach is that the same $id will always map to the same $public_id, as long as the $salt value stays constant.
If that is not an option, generate a random key and encrypt the real ID with it, and use the encrypted version as a public ID. You can then decrypt this ID later to get the real ID back.
You didn't specify a programming language. Here's an example in PHP, similar to what RJH suggested with SHA1, but uses a proper symmetric encryption algorithm rather than SHA1, eliminating the (even remote) possibility of collisions:
define('KEY', 'S4mPhZg3rQga'); function encrypt($text) { return base64_encode(mcrypt_encrypt(MCRYPT_RIJNDAEL_256, KEY, $text, MCRYPT_MODE_ECB, mcrypt_create_iv(mcrypt_get_iv_size(MCRYPT_RIJNDAEL_256, MCRYPT_MODE_ECB), MCRYPT_RAND))); } function decrypt($text) { return mcrypt_decrypt(MCRYPT_RIJNDAEL_256, KEY, base64_decode($text), MCRYPT_MODE_ECB, mcrypt_create_iv(mcrypt_get_iv_size(MCRYPT_RIJNDAEL_256, MCRYPT_MODE_ECB), MCRYPT_RAND)); } // example usage: $C = encrypt('1234'); echo("Public ID: $C\n"); $P = decrypt($C); echo("Private ID: $P\n");
The value of KEY should be set once, with the same value in both servers, and should never be revealed. You would use encrypt() when displaying data and decrypt() when accepting data from outside. There is no need to actually store the PublicID, you just compute it on the fly.
Since you want to be able to recreate the identifier on two, disconnected, databases then you'll need to have some kind of shared key.
This is a perfect place for a HMAC. To steal from RFC-2104 by way of Wikipedia:
Let:
H(·) be a cryptographic hash function
K be a secret key padded to the right with extra zeros to the block size of the hash function
m be the message to be authenticated
∥ denote concatenation
⊕ denote exclusive or (XOR)
opad be the outer padding (0x5c5c5c…5c5c, one-block-long hexadecimal constant)
ipad be the inner padding (0x363636…3636, one-block-long hexadecimal constant)Then HMAC(K,m) is mathematically defined by
HMAC(K,m) = H((K ⊕ opad) ∥ H((K ⊕ ipad) ∥ m)).
But, you don't have to implement it yourself! Use your language of choice's standard library. For example, in Python:
>>> import hmac
>>> hmac.new(key='abc123secret make me long', msg='This is my unique key #1')
<hmac.HMAC instance at 0xb77bdbac>
>>> _.hexdigest()
'c23a224afa917d13fbef58ee14884269'
Now you have a calculable unique ID. Pre-compute as the primary-keys in your database. Lookup as necessary!
As a sidenote, do NOT use salted hash (Google: "don't hash secrets") and do NOT use an encrypted version of your data. The former because of message-extension attacks. The latter because you're unnecessarily exposing the data in a manner that replies solely on your key security.
I'd link with more references, but I'm a new user. :-\
精彩评论