golang_str: Start exposing Pygolang string types publicly

In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and str format whatever type qq argument is) I added custom bytes- and unicode- like types for qq to return instead of str with the idea for qq's result to be interoperable with both bytes and unicode. Citing that patch: qq is used to quote strings or byte-strings. The following example illustrates the problem we are currently hitting in zodbtools with Python3: >>> "hello %s" % qq("мир") 'hello "мир"' >>> b"hello %s" % qq("мир") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' >>> "hello %s" % qq(b("мир")) 'hello "мир"' >>> b"hello %s" % qq(b("мир")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' i.e. one way or another if type of format string and what qq returns do not match it creates a TypeError. We want qq(obj) to be useable with both string and bytestring format. For that let's teach qq to return special str- and bytes- derived types that know how to automatically convert to str->bytes and bytes->str via b/u correspondingly. This way formatting works whatever types combination it was for format and for qq, and the whole result has the same type as format. For now we teach only qq to use new types and don't generally expose _str and _unicode to be returned by b and u yet. However we might do so in the future after incrementally gaining a bit more experience. So two years later I gained that experience and found that having string type, that can interoperate with both bytes and unicode, is generally useful. It is useful for practical backward compatibility with Python2 and for simplicity of programming avoiding constant stream of encode/decode noise. Thus the day to expose Pygolang string types for general use has come. This patch does the first small step: it exposes bytes- and unicode- like types (now named as bstr and ustr) publicly. It switches b and u to return bstr and ustr correspondingly instead of bytes and unicode. This is change in behaviour, but hopefully it should not break anything as there are not many b/u users currently and bstr and ustr are intended to be drop-in replacements for standard string types. Next patches will enhance bstr/ustr step by step to be actually drop-in replacements for standard string types for real. See nexedi/zodbtools!13 (comment 81646) for preliminary discussion from 2019. See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost overview"[2] for related presentation by Jean-Paul from 2018. [1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1 [2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20

golang_str: Start exposing Pygolang string types publicly
In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and str format whatever type qq argument is) I added custom bytes- and unicode- like types for qq to return instead of str with the idea for qq's result to be interoperable with both bytes and unicode. Citing that patch: qq is used to quote strings or byte-strings. The following example illustrates the problem we are currently hitting in zodbtools with Python3: >>> "hello %s" % qq("мир") 'hello "мир"' >>> b"hello %s" % qq("мир") Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' >>> "hello %s" % qq(b("мир")) 'hello "мир"' >>> b"hello %s" % qq(b("мир")) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str' i.e. one way or another if type of format string and what qq returns do not match it creates a TypeError. We want qq(obj) to be useable with both string and bytestring format. For that let's teach qq to return special str- and bytes- derived types that know how to automatically convert to str->bytes and bytes->str via b/u correspondingly. This way formatting works whatever types combination it was for format and for qq, and the whole result has the same type as format. For now we teach only qq to use new types and don't generally expose _str and _unicode to be returned by b and u yet. However we might do so in the future after incrementally gaining a bit more experience. So two years later I gained that experience and found that having string type, that can interoperate with both bytes and unicode, is generally useful. It is useful for practical backward compatibility with Python2 and for simplicity of programming avoiding constant stream of encode/decode noise. Thus the day to expose Pygolang string types for general use has come. This patch does the first small step: it exposes bytes- and unicode- like types (now named as bstr and ustr) publicly. It switches b and u to return bstr and ustr correspondingly instead of bytes and unicode. This is change in behaviour, but hopefully it should not break anything as there are not many b/u users currently and bstr and ustr are intended to be drop-in replacements for standard string types. Next patches will enhance bstr/ustr step by step to be actually drop-in replacements for standard string types for real. See nexedi/zodbtools!13 (comment 81646) for preliminary discussion from 2019. See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost overview"[2] for related presentation by Jean-Paul from 2018. [1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1 [2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
1f99393d · Kirill Smelkov · ffb40903 · 1f99393d · 1f99393d · 1f99393d
Commit 1f99393d authored Oct 05, 2022 by Kirill Smelkov
8 changed files
--- a/README.rst
+++ b/README.rst
@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python:
 - `func` allows to define methods separate from class.
 - `defer` allows to schedule a cleanup from the main control flow.
 - `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode.
+- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
 - `gimport` allows to import python modules by full path in a Go workspace.

 Package `golang.pyx` provides__ similar features for Cython/nogil.
@@ -229,19 +229,32 @@ __ https://www.python.org/dev/peps/pep-3134/
 Strings
 -------

-`b` and `u` provide way to make sure an object is either bytes or unicode.
-`b(obj)` converts str/unicode/bytes obj to UTF-8 encoded bytestring, while
-`u(obj)` converts str/unicode/bytes obj to unicode string. For example::
+Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
+the idea to make working with byte- and unicode- strings easy and transparently
+interoperable:

-   b("привет мир")   # -> gives bytes corresponding to UTF-8 encoding of "привет мир".
+- `bstr` is byte-string: it is based on `bytes` and can automatically convert to `unicode` [*]_.
+- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to `bytes`.
+
+The conversion, in both encoding and decoding, never fails and never looses
+information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
+even if bytes data is not valid UTF-8.
+
+`bstr`/`ustr` constructors will accept arbitrary objects and either convert or stringify them. For
+cases when no stringification is desired, and one only wants to convert
+`bstr`/`ustr` / `unicode`/`bytes`
+to Pygolang string, `b` and `u` provide way to make sure an
+object is either `bstr` or `ustr` correspondingly.
+
+Usage example::
+
+   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.

   def f(s):
-      s = u(s)       # make sure s is unicode, decoding as UTF-8(*) if it was bytes.
-      ...            # (*) but see below about lack of decode errors.
+      s = u(s)          # make sure s is ustr, decoding as UTF-8(*) if it was bstr or bytes.
+      ...               # (*) the decoding never fails nor looses information.

-The conversion in both encoding and decoding never fails and never looses
-information: `b(u(·))` and `u(b(·))` are always identity for bytes and unicode
-correspondingly, even if bytes input is not valid UTF-8.
+.. [*] `unicode` on Python2, `str` on Python3.


 Import

--- a/golang/__init__.py
+++ b/golang/__init__.py
@@ -24,7 +24,7 @@
 - `func` allows to define methods separate from class.
 - `defer` allows to schedule a cleanup from the main control flow.
 - `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode.
+- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
 - `gimport` allows to import python modules by full path in a Go workspace.

 See README for thorough overview.
@@ -36,7 +36,7 @@ from __future__ import print_function, absolute_import
 __version__ = "0.1"

 __all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
-           'recover', 'func', 'error', 'b', 'u', 'gimport']
+           'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'gimport']

 from golang._gopath import gimport  # make gimport available from golang
 import inspect, sys
@@ -316,7 +316,9 @@ from ._golang import    \
    pypanic     as panic,   \
    pyerror     as error,   \
    pyb         as b,       \
-    pyu         as u
+    pybstr      as bstr,    \
+    pyu         as u,       \
+    pyustr      as ustr

 # import golang.strconv into _golang from here to workaround cyclic golang ↔ strconv dependency
 def _():

--- a/golang/_golang.pxd
+++ b/golang/_golang.pxd
@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py:
 - Python-level channels are represented by pychan + pyselect.
 - Python-level error is represented by pyerror.
 - Python-level panic is represented by pypanic.
+- Python-level strings are represented by pybstr and pyustr.
 """



--- a/golang/_golang_str.pyx
+++ b/golang/_golang_str.pyx
@@ -28,7 +28,7 @@ from libc.stdint cimport uint8_t

 pystrconv = None  # = golang.strconv imported at runtime (see __init__.py)

-def pyb(s): # -> bytes
+def pyb(s): # -> bstr
    """b converts str/unicode/bytes s to UTF-8 encoded bytestring.

       Bytes input is preserved as-is:
@@ -42,8 +42,11 @@ def pyb(s): # -> bytes

       TypeError is raised if type(s) is not one of the above.

-       See also: u.
+       See also: u, bstr/ustr.
    """
+    if type(s) is pybstr:
+        return s
+
    if isinstance(s, bytes):                    # py2: str      py3: bytes
        pass
    elif isinstance(s, unicode):                # py2: unicode  py3: str
@@ -51,9 +54,9 @@ def pyb(s): # -> bytes
    else:
        raise TypeError("b: invalid type %s" % type(s))

-    return s
+    return pybstr(s)

-def pyu(s): # -> unicode
+def pyu(s): # -> ustr
    """u converts str/unicode/bytes s to unicode string.

       Unicode input is preserved as-is:
@@ -69,8 +72,11 @@ def pyu(s): # -> unicode

       TypeError is raised if type(s) is not one of the above.

-       See also: b.
+       See also: b, bstr/ustr.
    """
+    if type(s) is pyustr:
+        return s
+
    if isinstance(s, unicode):                  # py2: unicode  py3: str
        pass
    elif isinstance(s, bytes):                  # py2: str      py3: bytes
@@ -78,22 +84,22 @@ def pyu(s): # -> unicode
    else:
        raise TypeError("u: invalid type %s" % type(s))

-    return s
+    return pyustr(s)


-# __pystr converts obj to str of current python:
+# __pystr converts obj to ~str of current python:
 #
-#   - to bytes,   via b, if running on py2, or
-#   - to unicode, via u, if running on py3.
+#   - to ~bytes,   via b, if running on py2, or
+#   - to ~unicode, via u, if running on py3.
 #
 # It is handy to use __pystr when implementing __str__ methods.
 #
 # NOTE __pystr is currently considered to be internal function and should not
 # be used by code outside of pygolang.
 #
-# XXX we should be able to use _pystr, but py3's str verify that it must have
+# XXX we should be able to use pybstr, but py3's str verify that it must have
 # Py_TPFLAGS_UNICODE_SUBCLASS in its type flags.
-cdef __pystr(object obj):
+cdef __pystr(object obj): # -> ~str
    if PY_MAJOR_VERSION >= 3:
        return pyu(obj)
    else:
@@ -101,8 +107,8 @@ cdef __pystr(object obj):


 # XXX cannot `cdef class`: github.com/cython/cython/issues/711
-class _pystr(bytes):
-    """_str is like bytes but can be automatically converted to Python unicode
+class pybstr(bytes):
+    """bstr is like bytes but can be automatically converted to Python unicode
    string via UTF-8 decoding.

    The decoding never fails nor looses information - see u for details.
@@ -123,8 +129,8 @@ class _pystr(bytes):
            return self


-cdef class _pyunicode(unicode):
-    """_unicode is like unicode(py2)|str(py3) but can be automatically converted
+cdef class pyustr(unicode):
+    """ustr is like unicode(py2)|str(py3) but can be automatically converted
    to bytes via UTF-8 encoding.

    The encoding always succeeds - see b for details.
@@ -139,11 +145,11 @@ cdef class _pyunicode(unicode):
        else:
            return pyb(self)

-# initialize .tp_print for _pystr so that this type could be printed.
+# initialize .tp_print for pybstr so that this type could be printed.
 # If we don't - printing it will result in `RuntimeError: print recursion`
 # because str of this type never reaches real bytes or unicode.
 # Do it only on python2, because python3 does not use tp_print at all.
-# NOTE _pyunicode does not need this because on py2 str(_pyunicode) returns _pystr.
+# NOTE pyustr does not need this because on py2 str(pyustr) returns pybstr.
 IF PY2:
    # NOTE Cython does not define tp_print for PyTypeObject - do it ourselves
    from libc.stdio cimport FILE
@@ -153,12 +159,12 @@ IF PY2:
            printfunc tp_print
        cdef PyTypeObject *Py_TYPE(object)

-    cdef int _pystr_tp_print(PyObject *obj, FILE *f, int nesting) except -1:
+    cdef int _pybstr_tp_print(PyObject *obj, FILE *f, int nesting) except -1:
        o = <bytes>obj
-        o = bytes(buffer(o))  # change tp_type to bytes instead of _pystr
+        o = bytes(buffer(o))  # change tp_type to bytes instead of pybstr
        return Py_TYPE(o).tp_print(<PyObject*>o, f, nesting)

-    Py_TYPE(_pystr()).tp_print = _pystr_tp_print
+    Py_TYPE(pybstr()).tp_print = _pybstr_tp_print


 # qq is substitute for %q, which is missing in python.
@@ -179,9 +185,9 @@ def pyqq(obj):
    # a-la str type (unicode on py3, bytes on py2), that can be transparently
    # converted to unicode or bytes as needed.
    if PY_MAJOR_VERSION >= 3:
-        qobj = _pyunicode(pyu(qobj))
+        qobj = pyu(qobj)
    else:
-        qobj = _pystr(pyb(qobj))
+        qobj = pyb(qobj)

    return qobj


--- a/golang/golang_str_test.py
+++ b/golang/golang_str_test.py
@@ -111,7 +111,7 @@ def test_strings():
    assert isinstance(_, unicode)
    assert u(_) is _

-# verify print for _pystr and _pyunicode
+# verify print for bstr/ustr.
 def test_strings_print():
    outok = readfile(dir_testprog + "/golang_test_str.txt")
    retcode, stdout, stderr = _pyrun(["golang_test_str.py"],

--- a/golang/testprog/golang_test_str.py
+++ b/golang/testprog/golang_test_str.py
@@ -18,7 +18,7 @@
 #
 # See COPYING file for full licensing terms.
 # See https://www.nexedi.com/licensing for rationale and options.
-"""This program helps to verify _pystr and _pyunicode.
+"""This program helps to verify b, u and underlying bstr and ustr.

 It complements golang_str_test.test_strings_print.
 """
@@ -31,6 +31,8 @@ from golang.gcompat import qq
 def main():
    sb = b("привет b")
    su = u("привет u")
+    print("print(b):", sb)
+    print("print(u):", su)
    print("print(qq(b)):", qq(sb))
    print("print(qq(u)):", qq(su))


--- a/golang/testprog/golang_test_str.txt
+++ b/golang/testprog/golang_test_str.txt
+print(b): привет b
+print(u): привет u
 print(qq(b)): "привет b"
 print(qq(u)): "привет u"
--- a/gpython/gpython_test.py
+++ b/gpython/gpython_test.py
 # -*- coding: utf-8 -*-
-# Copyright (C) 2019-2021  Nexedi SA and Contributors.
+# Copyright (C) 2019-2022  Nexedi SA and Contributors.
 #                          Kirill Smelkov <kirr@nexedi.com>
 #
 # This program is free software: you can Use, Study, Modify and Redistribute
@@ -71,6 +71,8 @@ def test_golang_builtins():
    assert error  is golang.error
    assert b      is golang.b
    assert u      is golang.u
+    assert bstr   is golang.bstr
+    assert ustr   is golang.ustr

    # indirectly verify golang.__all__
    for k in golang.__all__: