• Kirill Smelkov's avatar
    encoder: Fix protocol 0 UNICODE emission for invalid UTF-8 · 9ee383b6
    Kirill Smelkov authored
    In 9daf6a2a (Fix UNICODE decoding) I fixed protocol 0 UNICODE decoding
    by implementing "raw-unicode-escape" decoder and switching unpickler to
    use that to match 1-to-1 what Python unpickler does.
    
    In 57f875fd (encoder: Fix protocol 0 UNICODE emission) I further fixed
    protocol 0 UNICODE encoding by implementing "raw-unicode-escape" encoder
    and switching pickler to use that, trying to match 1-to-1 what Python
    pickler does.
    
    However there is a difference in between Python unicode and unicode on
    Go side: in Python unicode is immutable array of UCS code points. On Go
    side, unicode is immutable array of bytes, that, similarly to Go string
    are treated as being mostly UTF-8. We did not pick e.g. []rune to
    represent unicode on Go side because []rune is not immutable and so
    cannot be used as map keys.
    
    So Go unicode can be either valid UTF-8 or invalid UTF-8. For valid
    UTF-8 case everything is working as intended: our raw-unicode-escape
    decoder, because it works in terms of UCS, can produce only valid UTF-8,
    while our raw-unicode-escape encoder also matches 1-to-1 what python
    does because for valid UTF-8 utf8.DecodeRuneInString gives good rune and
    the encoder further works in terms of UCS.
    
    However for the case of invalid UTF-8, there is a difference in between
    what Python and Go raw-unicode-escape encoders do: for Python this case
    is simply impossible because there input is []UCS. For the Go case, in
    57f875fd I extended the encoder to do:
    
    	// invalid UTF-8 -> emit byte as is
    	case r == utf8.RuneError:
    		out = append(out, s[0])
    
    which turned out to be not very thoughtful and wrong because
    original raw-unicode-escape also emits UCS < 0x100 as those plain bytes
    and so if we also emit invalid UTF-8 as is then the following two inputs
    would be encoded into the same representation "\x93":
    
    	unicode("\x93")     // invalid UTF-8
    	unicode("\xc2\x93)  // UTF-8 of \u93
    
    which would break `decode/encode = identity` invariant and corrupt the
    data because when loaded back it will be "\xc2\x93" instead of original "\x93".
    
    -> Fix it by rejecting to encode such invalid UTF-8 via protocol 0 UNICODE.
    
    Unfortunately rejection is the only reasonable choice because
    raw-unicode-escape codec does not allow \xAA to be present in the output
    stream and so there is simply no way to represent arbitrary bytes there.
    
    Better to give explicit error instead of corrupting the data.
    
    For protocols ≥ 1 arbitrary unicode - both valid and invalid UTF-8 - can
    still be loaded and saved because *BINUNICODE opcodes come with
    bytestring argument and there is no need to decode/encode those
    bytestrings.
    9ee383b6