Uniform UTF8-based approach to strings
Context: together with Jérome we've been struggling with porting Zodbtools to Python3 for several years. Despite several incremental attempts[1,2,3] we are not there yet with the main difficulty being backward compatibility breakage that Python3 did for bytes and unicode. During my last trial this spring, after I've tried once again to finish this porting and could not reach satisfactory result, I've finally decided to do something about this at the root of the cause: at the level of strings - where backward compatibility was broken - with the idea to fix everything once and for all.
In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost overview"[5] Jean-Paul highlighted the problem of strings backward compatibility breakage, that Python 3 did, as the major one.
In 2019 we had some conversations with Jérome about this topic as well[6,7].
In 2020 I've started to approach it with b
and u
that provide
always-working conversion in between bytes and unicode[8], and via limited
usage of custom bytes- and unicode- like types that are interoperable with both
bytes and unicode simultaneously[9].
Today, with this work, I'm finally exposing those types for general usage, so that bytes/unicode problem could be handled automatically. The overview of the functionality is provided below:
---- 8< ----
Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with the idea to make working with byte- and unicode- strings easy and transparently interoperable:
-
bstr
is byte-string: it is based onbytes
and can automatically convert to/fromunicode
(*). -
ustr
is unicode-string: it is based onunicode
and can automatically convert to/frombytes
.
The conversion, in both encoding and decoding, never fails and never looses
information: bstr→ustr→bstr
and ustr→bstr→ustr
are always identity
even if bytes data is not valid UTF-8.
Both bstr
and ustr
represent stings. They are two different representations of the same entity.
Semantically bstr
is array of bytes, while ustr
is array of
unicode-characters. Accessing their elements by [index]
and iterating them yield byte and
unicode character correspondingly (+). Iterating them, however, yields unicode
characters for both . However it is possible to yield unicode character when iterating bstr
and ustr
bstr
via uiter
, and to yield byte character when
iterating ustr
via biter
. In practice bstr
+ uiter
is enough 99% of the
time, and ustr
only needs to be used for random access to string characters.
See Strings, bytes, runes and characters in Go for overview of this approach.
Operations in between bstr
and ustr
/unicode
/ bytes
/bytearray
coerce to bstr
, while
operations in between ustr
and bstr
/bytes
/bytearray
/ unicode
coerce
to ustr
. When the coercion happens, bytes
and bytearray
, similarly to
bstr
, are also treated as UTF8-encoded strings.
bstr
and ustr
are meant to be drop-in replacements for standard
str
/unicode
classes. They support all methods of str
/unicode
and in
particular their constructors accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
bstr
/ustr
/ unicode
/bytes
/bytearray
, or an object with buffer
interface (%), to Pygolang string, b
and u
provide way to make sure an
object is either bstr
or ustr
correspondingly.
Usage example:
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
s += ' мир' # s is b('привет мир')
for c in uiter(s): # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
# the following gives b('привет мир труд май')
b('привет %s %s %s') % (u'мир', # raw unicode
u'труд'.encode('utf-8'), # raw bytes
u('май')) # ustr
def f(s):
s = u(s) # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
... # (^) the decoding never fails nor looses information.
(*) unicode
on Python2, str
on Python3.
(+) ordinal of such byte and unicode character can be obtained via regular ord
.
For completeness bbyte
and uchr
are also provided for constructing 1-byte bstr
and 1-character ustr
from ordinal.
(%) data in buffer, similarly to bytes
and bytearray
, is treated as UTF8-encoded string.
Notice that only explicit conversion through b
and u
accept objects with buffer interface. Automatic coercion does not.
---- 8< ----
With this e.g. zodbtools is finally ported to Python3 easily[10].
One note is that we change b
and u
to return bstr
/ustr
instead of
bytes
/unicode
. This is change in behaviour, but I hope it won't break
anything. The reason for this is that now-returned bstr
and ustr
are meant
to be drop-in replacements for standard string types, and that there are not
many existing b
and u
users. We just need to make sure that the places,
that already use b
and u
continue to work. Those include Zodbtools,
Nxdtest[11], and lonet[12], which should continue to work ok.
@klaus, you once said that you use b
and u
somewhere as well. Please do not
hesitate to let me know if this change causes any issues for you, and we will,
hopefully, try to find a solution.
Kirill
/cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya
[1] zodbtools!12 (closed)
[2] zodbtools!13 (merged)
[3] zodbtools!16 (merged)
[4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
[6] zodbtools!8 (comment 73726)
[7] zodbtools!13 (comment 81646)
[8] bcb95cd5
[9] edc7aaab
[10] zodbtools@9861c136
[11] https://lab.nexedi.com/nexedi/nxdtest
[12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py
EDIT 2024-05-07: Adjusted iter(bstr)
to yield bytes instead of unicode characters as explained in !21 (comment 206044).