Commit 1f99393d authored by Kirill Smelkov's avatar Kirill Smelkov

golang_str: Start exposing Pygolang string types publicly

In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and
str format whatever type qq argument is) I added custom bytes- and
unicode- like types for qq to return instead of str with the idea for
qq's result to be interoperable with both bytes and unicode. Citing that patch:

    qq is used to quote strings or byte-strings. The following example
    illustrates the problem we are currently hitting in zodbtools with
    Python3:

        >>> "hello %s" % qq("мир")
        'hello "мир"'

        >>> b"hello %s" % qq("мир")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

        >>> "hello %s" % qq(b("мир"))
        'hello "мир"'

        >>> b"hello %s" % qq(b("мир"))
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

    i.e. one way or another if type of format string and what qq returns do not
    match it creates a TypeError.

    We want qq(obj) to be useable with both string and bytestring format.

    For that let's teach qq to return special str- and bytes- derived types that
    know how to automatically convert to str->bytes and bytes->str via b/u
    correspondingly. This way formatting works whatever types combination it was
    for format and for qq, and the whole result has the same type as format.

    For now we teach only qq to use new types and don't generally expose
    _str and _unicode to be returned by b and u yet. However we might do so
    in the future after incrementally gaining a bit more experience.

So two years later I gained that experience and found that having string
type, that can interoperate with both bytes and unicode, is generally
useful. It is useful for practical backward compatibility with Python2
and for simplicity of programming avoiding constant stream of
encode/decode noise. Thus the day to expose Pygolang string types for
general use has come.

This patch does the first small step: it exposes bytes- and unicode-
like types (now named as bstr and ustr) publicly. It switches b and u to
return bstr and ustr correspondingly instead of bytes and unicode. This
is change in behaviour, but hopefully it should not break anything as
there are not many b/u users currently and bstr and ustr are intended to
be drop-in replacements for standard string types.

Next patches will enhance bstr/ustr step by step to be actually drop-in
replacements for standard string types for real.

See nexedi/zodbtools!13 (comment 81646)
for preliminary discussion from 2019.

See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost
overview"[2] for related presentation by Jean-Paul from 2018.

[1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
parent ffb40903
......@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python:
- `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode.
- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace.
Package `golang.pyx` provides__ similar features for Cython/nogil.
......@@ -229,19 +229,32 @@ __ https://www.python.org/dev/peps/pep-3134/
Strings
-------
`b` and `u` provide way to make sure an object is either bytes or unicode.
`b(obj)` converts str/unicode/bytes obj to UTF-8 encoded bytestring, while
`u(obj)` converts str/unicode/bytes obj to unicode string. For example::
Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
the idea to make working with byte- and unicode- strings easy and transparently
interoperable:
b("привет мир") # -> gives bytes corresponding to UTF-8 encoding of "привет мир".
- `bstr` is byte-string: it is based on `bytes` and can automatically convert to `unicode` [*]_.
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to `bytes`.
The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.
`bstr`/`ustr` constructors will accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`
to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.
Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
def f(s):
s = u(s) # make sure s is unicode, decoding as UTF-8(*) if it was bytes.
... # (*) but see below about lack of decode errors.
s = u(s) # make sure s is ustr, decoding as UTF-8(*) if it was bstr or bytes.
... # (*) the decoding never fails nor looses information.
The conversion in both encoding and decoding never fails and never looses
information: `b(u(·))` and `u(b(·))` are always identity for bytes and unicode
correspondingly, even if bytes input is not valid UTF-8.
.. [*] `unicode` on Python2, `str` on Python3.
Import
......
......@@ -24,7 +24,7 @@
- `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode.
- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace.
See README for thorough overview.
......@@ -36,7 +36,7 @@ from __future__ import print_function, absolute_import
__version__ = "0.1"
__all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
'recover', 'func', 'error', 'b', 'u', 'gimport']
'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'gimport']
from golang._gopath import gimport # make gimport available from golang
import inspect, sys
......@@ -316,7 +316,9 @@ from ._golang import \
pypanic as panic, \
pyerror as error, \
pyb as b, \
pyu as u
pybstr as bstr, \
pyu as u, \
pyustr as ustr
# import golang.strconv into _golang from here to workaround cyclic golang ↔ strconv dependency
def _():
......
......@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py:
- Python-level channels are represented by pychan + pyselect.
- Python-level error is represented by pyerror.
- Python-level panic is represented by pypanic.
- Python-level strings are represented by pybstr and pyustr.
"""
......
......@@ -28,7 +28,7 @@ from libc.stdint cimport uint8_t
pystrconv = None # = golang.strconv imported at runtime (see __init__.py)
def pyb(s): # -> bytes
def pyb(s): # -> bstr
"""b converts str/unicode/bytes s to UTF-8 encoded bytestring.
Bytes input is preserved as-is:
......@@ -42,8 +42,11 @@ def pyb(s): # -> bytes
TypeError is raised if type(s) is not one of the above.
See also: u.
See also: u, bstr/ustr.
"""
if type(s) is pybstr:
return s
if isinstance(s, bytes): # py2: str py3: bytes
pass
elif isinstance(s, unicode): # py2: unicode py3: str
......@@ -51,9 +54,9 @@ def pyb(s): # -> bytes
else:
raise TypeError("b: invalid type %s" % type(s))
return s
return pybstr(s)
def pyu(s): # -> unicode
def pyu(s): # -> ustr
"""u converts str/unicode/bytes s to unicode string.
Unicode input is preserved as-is:
......@@ -69,8 +72,11 @@ def pyu(s): # -> unicode
TypeError is raised if type(s) is not one of the above.
See also: b.
See also: b, bstr/ustr.
"""
if type(s) is pyustr:
return s
if isinstance(s, unicode): # py2: unicode py3: str
pass
elif isinstance(s, bytes): # py2: str py3: bytes
......@@ -78,22 +84,22 @@ def pyu(s): # -> unicode
else:
raise TypeError("u: invalid type %s" % type(s))
return s
return pyustr(s)
# __pystr converts obj to str of current python:
# __pystr converts obj to ~str of current python:
#
# - to bytes, via b, if running on py2, or
# - to unicode, via u, if running on py3.
# - to ~bytes, via b, if running on py2, or
# - to ~unicode, via u, if running on py3.
#
# It is handy to use __pystr when implementing __str__ methods.
#
# NOTE __pystr is currently considered to be internal function and should not
# be used by code outside of pygolang.
#
# XXX we should be able to use _pystr, but py3's str verify that it must have
# XXX we should be able to use pybstr, but py3's str verify that it must have
# Py_TPFLAGS_UNICODE_SUBCLASS in its type flags.
cdef __pystr(object obj):
cdef __pystr(object obj): # -> ~str
if PY_MAJOR_VERSION >= 3:
return pyu(obj)
else:
......@@ -101,8 +107,8 @@ cdef __pystr(object obj):
# XXX cannot `cdef class`: github.com/cython/cython/issues/711
class _pystr(bytes):
"""_str is like bytes but can be automatically converted to Python unicode
class pybstr(bytes):
"""bstr is like bytes but can be automatically converted to Python unicode
string via UTF-8 decoding.
The decoding never fails nor looses information - see u for details.
......@@ -123,8 +129,8 @@ class _pystr(bytes):
return self
cdef class _pyunicode(unicode):
"""_unicode is like unicode(py2)|str(py3) but can be automatically converted
cdef class pyustr(unicode):
"""ustr is like unicode(py2)|str(py3) but can be automatically converted
to bytes via UTF-8 encoding.
The encoding always succeeds - see b for details.
......@@ -139,11 +145,11 @@ cdef class _pyunicode(unicode):
else:
return pyb(self)
# initialize .tp_print for _pystr so that this type could be printed.
# initialize .tp_print for pybstr so that this type could be printed.
# If we don't - printing it will result in `RuntimeError: print recursion`
# because str of this type never reaches real bytes or unicode.
# Do it only on python2, because python3 does not use tp_print at all.
# NOTE _pyunicode does not need this because on py2 str(_pyunicode) returns _pystr.
# NOTE pyustr does not need this because on py2 str(pyustr) returns pybstr.
IF PY2:
# NOTE Cython does not define tp_print for PyTypeObject - do it ourselves
from libc.stdio cimport FILE
......@@ -153,12 +159,12 @@ IF PY2:
printfunc tp_print
cdef PyTypeObject *Py_TYPE(object)
cdef int _pystr_tp_print(PyObject *obj, FILE *f, int nesting) except -1:
cdef int _pybstr_tp_print(PyObject *obj, FILE *f, int nesting) except -1:
o = <bytes>obj
o = bytes(buffer(o)) # change tp_type to bytes instead of _pystr
o = bytes(buffer(o)) # change tp_type to bytes instead of pybstr
return Py_TYPE(o).tp_print(<PyObject*>o, f, nesting)
Py_TYPE(_pystr()).tp_print = _pystr_tp_print
Py_TYPE(pybstr()).tp_print = _pybstr_tp_print
# qq is substitute for %q, which is missing in python.
......@@ -179,9 +185,9 @@ def pyqq(obj):
# a-la str type (unicode on py3, bytes on py2), that can be transparently
# converted to unicode or bytes as needed.
if PY_MAJOR_VERSION >= 3:
qobj = _pyunicode(pyu(qobj))
qobj = pyu(qobj)
else:
qobj = _pystr(pyb(qobj))
qobj = pyb(qobj)
return qobj
......
......@@ -111,7 +111,7 @@ def test_strings():
assert isinstance(_, unicode)
assert u(_) is _
# verify print for _pystr and _pyunicode
# verify print for bstr/ustr.
def test_strings_print():
outok = readfile(dir_testprog + "/golang_test_str.txt")
retcode, stdout, stderr = _pyrun(["golang_test_str.py"],
......
......@@ -18,7 +18,7 @@
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""This program helps to verify _pystr and _pyunicode.
"""This program helps to verify b, u and underlying bstr and ustr.
It complements golang_str_test.test_strings_print.
"""
......@@ -31,6 +31,8 @@ from golang.gcompat import qq
def main():
sb = b("привет b")
su = u("привет u")
print("print(b):", sb)
print("print(u):", su)
print("print(qq(b)):", qq(sb))
print("print(qq(u)):", qq(su))
......
print(b): привет b
print(u): привет u
print(qq(b)): "привет b"
print(qq(u)): "привет u"
# -*- coding: utf-8 -*-
# Copyright (C) 2019-2021 Nexedi SA and Contributors.
# Copyright (C) 2019-2022 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -71,6 +71,8 @@ def test_golang_builtins():
assert error is golang.error
assert b is golang.b
assert u is golang.u
assert bstr is golang.bstr
assert ustr is golang.ustr
# indirectly verify golang.__all__
for k in golang.__all__:
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment