Base64#decode64 does bad things (if you are expecting UTF-8 encoded strings)

Base64#decode64 does bad things (if you are expecting UTF-8 encoded strings)

In debugging a problem with broken bytes being shown in the browser where we were decoding base64 strings, I stumbled upon how Base64#decode64 returns strings once they are decoded.


The following code snippet shows that during an encoding of a unicode heavy string, the string is encoded correctly as Base64. However, upon decoding the string, the encoding of the decoded string is ASCII-8bit. In the case of of expecting it to return a UTF-8 string, it simply doesn't.

$ pry
[1] pry(main)> require "base64"
=> true
[2] pry(main)> Base64.encode64("öäöä")
=> "w7bDpMO2w6Q=\n"
[3] pry(main)> Base64.decode64("w7bDpMO2w6Q=\n")
=> "\xC3\xB6\xC3\xA4\xC3\xB6\xC3\xA4"
[4] pry(main)> Base64.decode64("w7bDpMO2w6Q=\n").encoding
=> #<Encoding:ASCII-8BIT>


To find the root cause of this, you need to dig into Array#unpack in the ruby source code. The important snippet is shown below:

switch (type) {
      case 'U':
        /* if encoding is US-ASCII, upgrade to UTF-8 */
        if (enc_info == 1) enc_info = 2;
      case 'm': case 'M': case 'u':
        /* keep US-ASCII (do nothing) */
        /* fall back to BINARY */
        enc_info = 0;

The important line is the case statement for a case of 'm'. This is important as the source for Base64#decode64 is as follows:

def decode64(str)

In this case, the string is casted as US-ASCII when it is unpacked, which may or may not be desirable. To be sure that you are returning back a UTF-8 string, be sure to force the encoding of the string as a chain method off of your Base64#decode64 method call. Ex:

[5] pry(main)> Base64.decode64("w7bDpMO2w6Q=\n").force_encoding('UTF-8').encode
=> "öäöä"