Commit c0a53847 authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: TODO UTF-8bk

bstr and ustr currently claim, that:

  - bstr → ustr → bstr
    is always identity even if bytes data is not valid UTF-8,  and

  - ustr → bstr → ustr
    is always identity even if bytes data is not valid UTF-8.

this is indeed true for any bytes data.

But for some (incorrect) unicode, the conversion from ustr → bstr might
currently fail as the following example demonstrates:

    # py3
    In [1]: x = u'\udc00'

    In [2]: x.encode('utf-8')
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

    In [3]: x.encode('utf-8', 'surrogateescape')
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

I know how to fix this by adjusting UTF-8b(*) encoding process a bit,
but I currently lack time to do it.

-> Let's place corresponding todo entry.

Please note, once again, that for arbitrary bytes input the conversion
from bstr → ustr → bstr always succeeds and works ok already. And it is
this particular conversion that is most relevant in practice.

(*) aka surrogateescape in python speak. See
http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html
for original explanation from 2000.
parent 023907ee
......@@ -1731,6 +1731,23 @@ cdef extern from "Python.h":
# ---- UTF-8 encode/decode ----
# TODO(kirr) adjust UTF-8 encode/decode surrogateescape(*) a bit so that not
# only bytes -> unicode -> bytes is always identity for any bytes (this is
# already true), but also that unicode -> bytes -> unicode is also always true
# for all unicode codepoints.
#
# The latter currently fails for all surrogate codepoints outside of U+DC80..U+DCFF range:
#
# In [1]: x = u'\udc00'
#
# In [2]: x.encode('utf-8')
# UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
#
# In [3]: x.encode('utf-8', 'surrogateescape')
# UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
#
# (*) aka UTF-8b (see http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html)
from six import unichr # py2: unichr py3: chr
from six import int2byte as bchr # py2: chr py3: lambda x: bytes((x,))
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment