golang_str: TODO UTF-8bk

bstr and ustr currently claim, that: - bstr → ustr → bstr is always identity even if bytes data is not valid UTF-8, and - ustr → bstr → ustr is always identity even if bytes data is not valid UTF-8. this is indeed true for any bytes data. But for some (incorrect) unicode, the conversion from ustr → bstr might currently fail as the following example demonstrates: # py3 In [1]: x = u'\udc00' In [2]: x.encode('utf-8') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed In [3]: x.encode('utf-8', 'surrogateescape') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed I know how to fix this by adjusting UTF-8b(*) encoding process a bit, but I currently lack time to do it. -> Let's place corresponding todo entry. Please note, once again, that for arbitrary bytes input the conversion from bstr → ustr → bstr always succeeds and works ok already. And it is this particular conversion that is most relevant in practice. (*) aka surrogateescape in python speak. See http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html for original explanation from 2000.

golang_str: TODO UTF-8bk
bstr and ustr currently claim, that: - bstr → ustr → bstr is always identity even if bytes data is not valid UTF-8, and - ustr → bstr → ustr is always identity even if bytes data is not valid UTF-8. this is indeed true for any bytes data. But for some (incorrect) unicode, the conversion from ustr → bstr might currently fail as the following example demonstrates: # py3 In [1]: x = u'\udc00' In [2]: x.encode('utf-8') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed In [3]: x.encode('utf-8', 'surrogateescape') UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed I know how to fix this by adjusting UTF-8b(*) encoding process a bit, but I currently lack time to do it. -> Let's place corresponding todo entry. Please note, once again, that for arbitrary bytes input the conversion from bstr → ustr → bstr always succeeds and works ok already. And it is this particular conversion that is most relevant in practice. (*) aka surrogateescape in python speak. See http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html for original explanation from 2000.
c0a53847 · Kirill Smelkov · 023907ee · c0a53847
Commit c0a53847 authored Oct 09, 2022 by Kirill Smelkov
Hide whitespace changes
Inline Side-by-side

Showing with 17 additions and 0 deletions

golang/_golang_str.pyx golang/_golang_str.pyx +17 -0

No files found.
--- a/golang/_golang_str.pyx
+++ b/golang/_golang_str.pyx
@@ -1731,6 +1731,23 @@ cdef extern from "Python.h":

 # ---- UTF-8 encode/decode ----

+# TODO(kirr) adjust UTF-8 encode/decode surrogateescape(*) a bit so that not
+# only bytes -> unicode -> bytes is always identity for any bytes (this is
+# already true), but also that unicode -> bytes -> unicode is also always true
+# for all unicode codepoints.
+#
+# The latter currently fails for all surrogate codepoints outside of U+DC80..U+DCFF range:
+#
+#   In [1]: x = u'\udc00'
+#
+#   In [2]: x.encode('utf-8')
+#   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
+#
+#   In [3]: x.encode('utf-8', 'surrogateescape')
+#   UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
+#
+# (*) aka UTF-8b (see http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html)
+
 from six import unichr                      # py2: unichr       py3: chr
 from six import int2byte as bchr            # py2: chr          py3: lambda x: bytes((x,))