ruby base64 encode / decode / unpack('m') troubles_问答_开发者

Having a strange ruby encoding encounter:

ruby-1.9.2-p180 :618 > s = "a8dnsjg8aiw8jq".ljust(16,'=')
 => "a8dnsjg8aiw8jq==" 
ruby-1.9.2-p180 :619 > s.size
 => 16 

ruby-1.9.2-p180 :620 > s.unpack('m0')
ArgumentError: invalid base64
    from (irb):631:in `unpack'

ruby-1.9.2-p180 :621 > s.unpack('m')
 => ["k\xC7g\xB28<j,<\x8E"] 
ruby-1.9.2-p180 :622 > s.unpack('m').first.size
 => 10

ruby-1.9.2-p180 :623 > s.unpack('m').pack('m')
 => "a8dnsjg8aiw8jg==\n" 
ruby-1.9.2-p180 :624 > s.unpack('m').pack('m') == s
 => false

Any idea why this is not symmetric!? And why is 'm0' (decode64_strict) not working at all? The input string is padded out to a multiple of 4 character开发者_如何转开发s in the base64 alphabet. Here it's 14 x 6 bits = 84 bits which is 10 1/2 8-bit bytes, i.e. 11 bytes. But the decoded string seems to drop the last nybble?

Am I missing something obvious or is this a bug? Workaround? cf. http://www.ietf.org/rfc/rfc4648.txt

There is no symmetry because Base64 is not a one-to-one mapping for padded strings. Let's start from the actual decoded content. If you view your decoded string in hex (using e.g. s.unpack('H*') it will be this:

6B C7 67 | B2 38 3C | 6A 2C 3C | 8E

I added the boundaries for each input block to the Base64 algorithm: it takes 3 octets of input and returns 4 characters of output. So our last block contains only one input octet, thus the result will be 4 characters that end in "==" according to the standard.

Let's see what the canonical encoding of that last block would be. In binary representation 8E is 10001110. The RFC tells us to pad the missing bits with zeroes until we reach the required 24 bits:

100011 100000 000000 000000

I made groups of 6 bits, because that's what we need to get the corresponding characters from the Base64 alphabet. The first group (100011) translates to 35 decimal and thus is a j in the Base64 alphabet. The second (100000) is 32 decimal and hence a 'g'. The two remaining characters are to be padded as "==" according to the rules. So the canonical encoding is

jg==

If you look at jq== now, in binary this will be

100011 101010 000000 000000

So the difference is in the second group. But since we already know that only the first 8 bits are of interest to us (the "==" tells us so -> we only will retrieve one decoded octet from these four characters) we actually only care for the first two bits of the second group, because the 6 bits of group 1 and the 2 first bits of group 2 form our decoded octet. 100011 10 together form again our initial 8E byte value. The remaining 16 bits are irrelevant to us and can thus be discarded.

This also implies why the notion of "strict" Base64 encoding makes sense: non-strict decoding will discard any garbage at the end whereas strict decoding will check for the remaining bits to be zero in the final group of 6's. That's why your non-canonical encoding will be rejected by strict decoding rules.

The RFC you've linked says plainly that the final quad of form xx== corresponds to one octet of the input sequence. You cannot make 16 bits of information (two arbitrary octets) out of 12, so rounding up is invalid here.

Your string is rejected in the strict mode, because jq== cannot appear as a result of a correct Base64 encoding process. Input sequence which length is not multiple of 3 is zero-padded, and your string has non-zero bits where they cannot appear:

   j      q      =      =
|100011|101010|000000|000000|
|10001110|10100000|00000000|
          ^^^

From section 3.5 Canonical Encoding of RFC4648:

For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders…

and

In some environments, the alteration is critical and therefore decoders MAY chose to reject an encoding if the pad bits have not been set to zero.

Your last four bytes (jq==) decode to these binary values:

100011 101010
------ --****

The underlined bits are used to form the last encoded byte (hex 8E). The remaining bits (with asterisks under them) are supposed to be zero (which would be encoded jg==, not jq==).

The m unpacking is being forgiving about the padding bits that should be zero but are not. The m0 unpacking is not so forgiving, as it is allowed to be (see “MAY” in the quoted bit from the RFC). Packing the unpacked result is not symmetric because your encoded value is non-canonical, but the the pack method produces a canonical encoding (pad bits equal zero).

Thanks for the good explanations on b64. I've upvoted you all and accepted @emboss's response.

However, this is not the answer I was looking for. To better state the question, it would be,

How to pad a string of b64 characters so that it can be decoded to zero-padded 8-bit bytes by unpack('m0')?

From your explanations I now see that this will work for our purposes:

ruby-1.9.2-p180 :858 >   s = "a8dnsjg8aiw8jq".ljust(16,'A')
 => "a8dnsjg8aiw8jqAA" 
ruby-1.9.2-p180 :859 > s.unpack('m0')
 => ["k\xC7g\xB28<j,<\x8E\xA0\x00"] 
ruby-1.9.2-p180 :861 > s.unpack('m0').pack('m0') == s
 => true

The only problem then being that the decoded string length is not preserved, but we can work around that.