• Kirill Smelkov's avatar
    golang_str: TODO UTF-8bk · c0a53847
    Kirill Smelkov authored
    bstr and ustr currently claim, that:
      - bstr → ustr → bstr
        is always identity even if bytes data is not valid UTF-8,  and
      - ustr → bstr → ustr
        is always identity even if bytes data is not valid UTF-8.
    this is indeed true for any bytes data.
    But for some (incorrect) unicode, the conversion from ustr → bstr might
    currently fail as the following example demonstrates:
        # py3
        In [1]: x = u'\udc00'
        In [2]: x.encode('utf-8')
        UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
        In [3]: x.encode('utf-8', 'surrogateescape')
        UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
    I know how to fix this by adjusting UTF-8b(*) encoding process a bit,
    but I currently lack time to do it.
    -> Let's place corresponding todo entry.
    Please note, once again, that for arbitrary bytes input the conversion
    from bstr → ustr → bstr always succeeds and works ok already. And it is
    this particular conversion that is most relevant in practice.
    (*) aka surrogateescape in python speak. See
    for original explanation from 2000.
_golang_str.pyx 71.4 KB