golang/_golang_str.pyx · c0a53847c85c5d7628047efacb33e73045278379 · Kirill Smelkov / pygolang

Kirill Smelkov authored Oct 09, 2022

bstr and ustr currently claim, that:

  - bstr → ustr → bstr
    is always identity even if bytes data is not valid UTF-8,  and

  - ustr → bstr → ustr
    is always identity even if bytes data is not valid UTF-8.

this is indeed true for any bytes data.

But for some (incorrect) unicode, the conversion from ustr → bstr might
currently fail as the following example demonstrates:

    # py3
    In [1]: x = u'\udc00'

    In [2]: x.encode('utf-8')
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

    In [3]: x.encode('utf-8', 'surrogateescape')
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

I know how to fix this by adjusting UTF-8b(*) encoding process a bit,
but I currently lack time to do it.

-> Let's place corresponding todo entry.

Please note, once again, that for arbitrary bytes input the conversion
from bstr → ustr → bstr always succeeds and works ok already. And it is
this particular conversion that is most relevant in practice.

(*) aka surrogateescape in python speak. See
http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html
for original explanation from 2000.

c0a53847

_golang_str.pyx 71.4 KB

Replace _golang_str.pyx