Uniform UTF8-based approach to strings (!21) · Merge Requests · nexedi / pygolang

Uniform UTF8-based approach to strings

Context: together with Jérome we've been struggling with porting Zodbtools to Python3 for several years. Despite several incremental attempts[1,2,3] we are not there yet with the main difficulty being backward compatibility breakage that Python3 did for bytes and unicode. During my last trial this spring, after I've tried once again to finish this porting and could not reach satisfactory result, I've finally decided to do something about this at the root of the cause: at the level of strings - where backward compatibility was broken - with the idea to fix everything once and for all.

In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost overview"[5] Jean-Paul highlighted the problem of strings backward compatibility breakage, that Python 3 did, as the major one.

In 2019 we had some conversations with Jérome about this topic as well[6,7].

In 2020 I've started to approach it with b and u that provide always-working conversion in between bytes and unicode[8], and via limited usage of custom bytes- and unicode- like types that are interoperable with both bytes and unicode simultaneously[9].

Today, with this work, I'm finally exposing those types for general usage, so that bytes/unicode problem could be handled automatically. The overview of the functionality is provided below:

---- 8< ----

Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with the idea to make working with byte- and unicode- strings easy and transparently interoperable:

bstr is byte-string: it is based on bytes and can automatically convert to/from unicode (*).
ustr is unicode-string: it is based on unicode and can automatically convert to/from bytes.

The conversion, in both encoding and decoding, never fails and never looses information: bstr→ustr→bstr and ustr→bstr→ustr are always identity even if bytes data is not valid UTF-8.

Both bstr and ustr represent stings. They are two different representations of the same entity.

Semantically bstr is array of bytes, while ustr is array of unicode-characters. Accessing their elements by [index] and iterating them yield byte and unicode character correspondingly (+). ~~Iterating them, however, yields unicode characters for both bstr and ustr~~. However it is possible to yield unicode character when iterating bstr via uiter, and to yield byte character when iterating ustr via biter. In practice bstr + uiter is enough 99% of the time, and ustr only needs to be used for random access to string characters. See Strings, bytes, runes and characters in Go for overview of this approach.

Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr, while operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr. When the coercion happens, bytes and bytearray, similarly to bstr, are also treated as UTF8-encoded strings.

bstr and ustr are meant to be drop-in replacements for standard str/unicode classes. They support all methods of str/unicode and in particular their constructors accept arbitrary objects and either convert or stringify them. For cases when no stringification is desired, and one only wants to convert bstr/ustr / unicode/bytes/bytearray, or an object with buffer interface (%), to Pygolang string, b and u provide way to make sure an object is either bstr or ustr correspondingly.

Usage example:

   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
   s += ' мир'          # s is b('привет мир')
   for c in uiter(s):   # c will iterate through
        ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]

   # the following gives b('привет мир труд май')
   b('привет %s %s %s') % (u'мир',                  # raw unicode
                           u'труд'.encode('utf-8'), # raw bytes
                           u('май'))                # ustr

   def f(s):
      s = u(s)          # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
      ...               # (^) the decoding never fails nor looses information.

(*) unicode on Python2, str on Python3.
(+) ordinal of such byte and unicode character can be obtained via regular ord.
For completeness bbyte and uchr are also provided for constructing 1-byte bstr and 1-character ustr from ordinal.
(%) data in buffer, similarly to bytes and bytearray, is treated as UTF8-encoded string.
Notice that only explicit conversion through b and u accept objects with buffer interface. Automatic coercion does not.

---- 8< ----

With this e.g. zodbtools is finally ported to Python3 easily[10].

One note is that we change b and u to return bstr/ustr instead of bytes/unicode. This is change in behaviour, but I hope it won't break anything. The reason for this is that now-returned bstr and ustr are meant to be drop-in replacements for standard string types, and that there are not many existing b and u users. We just need to make sure that the places, that already use b and u continue to work. Those include Zodbtools, Nxdtest[11], and lonet[12], which should continue to work ok.

@klaus, you once said that you use b and u somewhere as well. Please do not hesitate to let me know if this change causes any issues for you, and we will, hopefully, try to find a solution.

Kirill

/cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya

[1] zodbtools!12 (closed)
[2] zodbtools!13 (merged)
[3] zodbtools!16 (merged)
[4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
[6] zodbtools!8 (comment 73726)
[7] zodbtools!13 (comment 81646)
[8] bcb95cd5
[9] edc7aaab
[10] zodbtools@9861c136
[11] https://lab.nexedi.com/nexedi/nxdtest
[12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py

EDIT 2024-05-07: Adjusted iter(bstr) to yield bytes instead of unicode characters as explained in !21 (comment 206044).

Context: together with Jérome we've been struggling with porting Zodbtools to
Python3 for several years. Despite several incremental attempts[1,2,3]
we are not there yet with the main difficulty being backward compatibility breakage
that Python3 did for bytes and unicode. During my last trial this spring, after
I've tried once again to finish this porting and could not reach satisfactory
result, I've finally decided to do something about this at the root of the
cause: at the level of strings - where backward compatibility was broken - with
the idea to fix everything once and for all.

In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost
overview"[5] Jean-Paul highlighted the problem of strings backward
compatibility breakage, that Python 3 did, as the major one.

In 2019 we had some conversations with Jérome about this topic as well[6,7].

In 2020 I've started to approach it with `b` and `u` that provide
always-working conversion in between bytes and unicode[8], and via limited
usage of custom bytes- and unicode- like types that are interoperable with both
bytes and unicode simultaneously[9].

Today, with this work, I'm finally exposing those types for general usage, so
that bytes/unicode problem could be handled automatically. The overview of the
functionality is provided below:

---- 8< ----

Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
the idea to make working with byte- and unicode- strings easy and transparently
interoperable:

- `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` (*).
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`.

The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.

Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.

Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly (+). ~~Iterating them, however, yields unicode
characters for both `bstr` and `ustr`~~. However it is possible to yield unicode character when iterating `bstr` via `uiter`, and to yield byte character when
iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of the
time, and `ustr` only needs to be used for random access to string characters.
See [Strings, bytes, runes and characters in Go](https://blog.golang.org/strings) for overview of this approach.

Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
to `ustr`.  When the coercion happens, `bytes` and `bytearray`, similarly to
`bstr`, are also treated as UTF8-encoded strings.

`bstr` and `ustr` are meant to be drop-in replacements for standard
`str`/`unicode` classes. They support all methods of `str`/`unicode` and in
particular their constructors accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer`
interface (%), to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.

Usage example:

```py
   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
   s += ' мир'          # s is b('привет мир')
   for c in uiter(s):   # c will iterate through
        ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]

# the following gives b('привет мир труд май')
   b('привет %s %s %s') % (u'мир',                  # raw unicode
                           u'труд'.encode('utf-8'), # raw bytes
                           u('май'))                # ustr

def f(s):
      s = u(s)          # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
      ...               # (^) the decoding never fails nor looses information.
```

(*) `unicode` on Python2, `str` on Python3.  
(+) ordinal of such byte and unicode character can be obtained via regular `ord`.  
&nbsp; &nbsp;   For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal.  
(%) data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string.  
&nbsp; &nbsp;   Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.

---- 8< ----

With this e.g. zodbtools is finally ported to Python3 easily[10].

One note is that we change `b` and `u` to return `bstr`/`ustr` instead of
`bytes`/`unicode`. This is change in behaviour, but I hope it won't break
anything. The reason for this is that now-returned `bstr` and `ustr` are meant
to be drop-in replacements for standard string types, and that there are not
many existing `b` and `u` users. We just need to make sure that the places,
that already use `b` and `u` continue to work. Those include Zodbtools,
Nxdtest[11], and lonet[12], which should continue to work ok.

@klaus, you once said that you use `b` and `u` somewhere as well. Please do not
hesitate to let me know if this change causes any issues for you, and we will,
hopefully, try to find a solution.

Kirill

/cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya

[1] https://lab.nexedi.com/nexedi/zodbtools/merge_requests/12  
[2] https://lab.nexedi.com/nexedi/zodbtools/merge_requests/13  
[3] https://lab.nexedi.com/nexedi/zodbtools/merge_requests/16  
[4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1  
[5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20  
[6] https://lab.nexedi.com/nexedi/zodbtools/merge_requests/8#note_73726  
[7] https://lab.nexedi.com/nexedi/zodbtools/merge_requests/13#note_81646  
[8] https://lab.nexedi.com/nexedi/pygolang/commit/bcb95cd5  
[9] https://lab.nexedi.com/nexedi/pygolang/commit/edc7aaab  
[10] https://lab.nexedi.com/nexedi/zodbtools/commit/9861c136  
[11] https://lab.nexedi.com/nexedi/nxdtest  
[12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py

--------

**EDIT** 2024-05-07: Adjusted `iter(bstr)` to yield bytes instead of unicode characters as explained in https://lab.nexedi.com/nexedi/pygolang/-/merge_requests/21#note_206044.

Edited Dec 22, 2024 by Kirill Smelkov

Uniform UTF8-based approach to strings

Check out, review, and merge locally