Uniform UTF8-based approach to strings
Context: together with Jérome we've been struggling with porting Zodbtools to Python3 for several years. Despite several incremental attempts[1,2,3] we are not there yet with the main difficulty being backward compatibility breakage that Python3 did for bytes and unicode. During my last trial this spring, after I've tried once again to finish this porting and could not reach satisfactory result, I've finally decided to do something about this at the root of the cause: at the level of strings - where backward compatibility was broken - with the idea to fix everything once and for all. In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost overview"[5] Jean-Paul highlighted the problem of strings backward compatibility breakage, that Python 3 did, as the major one. In 2019 we had some conversations with Jérome about this topic as well[6,7]. In 2020 I've started to approach it with `b` and `u` that provide always-working conversion in between bytes and unicode[8], and via limited usage of custom bytes- and unicode- like types that are interoperable with both bytes and unicode simultaneously[9]. Today, with this work, I'm finally exposing those types for general usage, so that bytes/unicode problem could be handled automatically. The overview of the functionality is provided below: ---- 8< ---- Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with the idea to make working with byte- and unicode- strings easy and transparently interoperable: - `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` (*). - `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`. The conversion, in both encoding and decoding, never fails and never looses information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity even if bytes data is not valid UTF-8. Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity. Semantically `bstr` is array of bytes, while `ustr` is array of unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and unicode character correspondingly (+). However it is possible to yield unicode character when iterating `bstr` via `uiter`, and to yield byte character when iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of the time, and `ustr` only needs to be used for random access to string characters. See [Strings, bytes, runes and characters in Go](https://blog.golang.org/strings) for overview of this approach. Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce to `ustr`. When the coercion happens, `bytes` and `bytearray`, similarly to `bstr`, are also treated as UTF8-encoded strings. `bstr` and `ustr` are meant to be drop-in replacements for standard `str`/`unicode` classes. They support all methods of `str`/`unicode` and in particular their constructors accept arbitrary objects and either convert or stringify them. For cases when no stringification is desired, and one only wants to convert `bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer` interface (%), to Pygolang string, `b` and `u` provide way to make sure an object is either `bstr` or `ustr` correspondingly. Usage example: ```py s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'. s += ' мир' # s is b('привет мир') for c in uiter(s): # c will iterate through ... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')] # the following gives b('привет мир труд май') b('привет %s %s %s') % (u'мир', # raw unicode u'труд'.encode('utf-8'), # raw bytes u('май')) # ustr def f(s): s = u(s) # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer. ... # (^) the decoding never fails nor looses information. ``` (*) `unicode` on Python2, `str` on Python3. (+) ordinal of such byte and unicode character can be obtained via regular `ord`. For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal. (%) data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string. Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not. ---- 8< ---- With this e.g. zodbtools is finally ported to Python3 easily[10]. One note is that we change `b` and `u` to return `bstr`/`ustr` instead of `bytes`/`unicode`. This is change in behaviour, but I hope it won't break anything. The reason for this is that now-returned `bstr` and `ustr` are meant to be drop-in replacements for standard string types, and that there are not many existing `b` and `u` users. We just need to make sure that the places, that already use `b` and `u` continue to work. Those include Zodbtools, Nxdtest[11], and lonet[12], which should continue to work ok. @klaus, you once said that you use `b` and `u` somewhere as well. Please do not hesitate to let me know if this change causes any issues for you, and we will, hopefully, try to find a solution. Kirill /cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya /reviewed-and-discussed-on !21 [1] zodbtools!12 [2] zodbtools!13 [3] zodbtools!16 [4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1 [5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20 [6] zodbtools!8 (comment 73726) [7] zodbtools!13 (comment 81646) [8] bcb95cd5 [9] edc7aaab [10] zodbtools@9861c136 [11] https://lab.nexedi.com/nexedi/nxdtest [12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py
Showing
This diff is collapsed.
golang/_strconv.pxd
0 → 100644
golang/_strconv.pyx
0 → 100644
This diff is collapsed.
This diff is collapsed.
golang/runtime.cpp
0 → 100644
golang/runtime.h
0 → 100644
golang/runtime/platform.h
0 → 100644
golang/strconv.pxd
0 → 100644
golang/unicode/__init__.py
0 → 100644
golang/unicode/_utf8.pxd
0 → 100644
golang/unicode/utf8.h
0 → 100644
golang/unicode/utf8.pxd
0 → 100644
Please register or sign in to comment