• Kirill Smelkov's avatar
    golang_str: TODO UTF-8bk · c0a53847
    Kirill Smelkov authored
    bstr and ustr currently claim, that:
    
      - bstr → ustr → bstr
        is always identity even if bytes data is not valid UTF-8,  and
    
      - ustr → bstr → ustr
        is always identity even if bytes data is not valid UTF-8.
    
    this is indeed true for any bytes data.
    
    But for some (incorrect) unicode, the conversion from ustr → bstr might
    currently fail as the following example demonstrates:
    
        # py3
        In [1]: x = u'\udc00'
    
        In [2]: x.encode('utf-8')
        UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
    
        In [3]: x.encode('utf-8', 'surrogateescape')
        UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
    
    I know how to fix this by adjusting UTF-8b(*) encoding process a bit,
    but I currently lack time to do it.
    
    -> Let's place corresponding todo entry.
    
    Please note, once again, that for arbitrary bytes input the conversion
    from bstr → ustr → bstr always succeeds and works ok already. And it is
    this particular conversion that is most relevant in practice.
    
    (*) aka surrogateescape in python speak. See
    http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html
    for original explanation from 2000.
    c0a53847
_golang_str.pyx 71.4 KB