Commits · 3b7128c1c5b28ae65562c2175a38e9d7d600d9e3 · Kirill Smelkov / pygolang

16 Nov, 2022 2 commits
- X patch builtin str/unicode · 3b7128c1
  Kirill Smelkov authored Oct 23, 2022
```
Needs cython@ed7c54e9.
```
  3b7128c1
- X bstr: bytes -> xbytes, unicode -> xunicode · 513be11e
  Kirill Smelkov authored Oct 23, 2022
```
Builtin types bytes and unicode will be patched.
xbytes and xunicode will refer to original unpatched types.
```
  513be11e
25 Oct, 2022 1 commit

golang_str: Fix bstr/ustr slice access on py2 · 300d7dfa

Kirill Smelkov authored Oct 25, 2022

In the patch "golang_str: bstr/ustr index access" we added __getitem__
implementation for bstr/ustr and thorough corresponding tests to cover
all access cases: [i], [i:j] and [i:j:k].

The tests, however, are run via pytest which does AST rewriting, and, as
it turned out, always invokes __getitem__ even for [i:j] case even on py2.
Which differs from plain python2 behaviour to invoke __getslice__ for
[i:j] case if __getslice__ slot is present.

Since on py2 both str and unicode provide __getslice__ implementation,
and bstr/ustr inherit from those types, they also inherit __getslice__.
And oops, then on py2 e.g. bstr[i:j] was returning str instead of bstr:

    In [1]: bs = b('αβγ')

    In [2]: bs
    Out[2]: b('αβγ')

    In [3]: bs[0]
    Out[3]: b(b'\xce')

    In [4]: bs[0:1]
    Out[4]: '\xce'              <-- NOTE not b(...)

    In [5]: type(_)
    Out[5]: str                 <-- NOTE not bstr

-> Fix it by explicitly whiting out __getslice__ slot for bstr and ustr.

300d7dfa

09 Oct, 2022 24 commits

golang_str: Cosmetics · 859a55eb
Kirill Smelkov authored Oct 09, 2022

859a55eb

golang_str: TODO UTF-8bk · c0a53847

Kirill Smelkov authored Oct 09, 2022

bstr and ustr currently claim, that:

  - bstr → ustr → bstr
    is always identity even if bytes data is not valid UTF-8,  and

  - ustr → bstr → ustr
    is always identity even if bytes data is not valid UTF-8.

this is indeed true for any bytes data.

But for some (incorrect) unicode, the conversion from ustr → bstr might
currently fail as the following example demonstrates:

    # py3
    In [1]: x = u'\udc00'

    In [2]: x.encode('utf-8')
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

    In [3]: x.encode('utf-8', 'surrogateescape')
    UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed

I know how to fix this by adjusting UTF-8b(*) encoding process a bit,
but I currently lack time to do it.

-> Let's place corresponding todo entry.

Please note, once again, that for arbitrary bytes input the conversion
from bstr → ustr → bstr always succeeds and works ok already. And it is
this particular conversion that is most relevant in practice.

(*) aka surrogateescape in python speak. See
http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html
for original explanation from 2000.

c0a53847

golang_str: bstr/ustr encode/decode · 023907ee

Kirill Smelkov authored Oct 09, 2022

So far we've overridden almost all string methods, that bstr/ustr
inherited from bytes and unicode. However 2 of the methods remained
intact until now: unicode.encode() and bytes.decode(). Let's override
them too for completeness:

- we want ustr.encode() to follow signature of unicode.encode and for ustr.encode('utf-8') to return bstr.
- for consistency we also want ustr.encode() to return the same type
  irregardless of which encoding/errors pair is used in the arguments.
- => ustr.encode() always returns bstr.
- we want bstr.decode() to follow signature of bytes.decode and for bstr.decode('utf-8') to return ustr.
- for consistency we also want bstr.decode() to return the same type
  irregardless of which encoding/errors pair is used in the arguments.
- -> bstr.decode() always returns ustr.

So  ustr.encode() -> bstr  and  bstr.decode() -> ustr.

Let's implement this carrying out encoding/decoding process internally
similarly to regular bytes and unicode and wrapping the result into
corresponding pygolang type at the end.

023907ee

golang_str: bstr/ustr .format() support · 0985c583

Kirill Smelkov authored Oct 09, 2022

Similarly to %-formatting, let's add support for .format(). This is
easier to do because we can leverage string.Formatting and hook into the
process by proper subclassing. We do not need to implement parsing and
need to only customize handling of 's' and 'r' specifiers.

For testing we mostly reuse existing tests for %-formatting by amending
them a bit to exercise both %-formatting and format-formatting at the
same time: by converting %-format specification into corresponding
{}-format specification and verifying formatting result for that to be
as expected.

Some explicit tests for {}-style .format() are also added.

0985c583

golang_str: bstr/ustr %-formatting · 390fd810

Kirill Smelkov authored Oct 09, 2022

Teach bstr/ustr to do % formatting similarly to how unicode does, but
with treating bytes as UTF8-encoded strings - all in line with
general idea for bstr/ustr to treat bytes as strings.

The following approach is used to implement this:

1. both bstr and ustr format via bytes-based _bprintf.
2. we parse the format string and handle every formatting specifier separately:
3. for formats besides %s/%r we use bytes.__mod__ directly.

4. for %s we stringify corresponding argument specially with all, potentially
   internal, bytes instances treated as UTF8-encoded strings:

      '%s' % b'\xce\xb2'      ->  "β"
      '%s' % [b'\xce\xb2']    ->  "['β']"

5. for %r, similarly to %s, we prepare repr of corresponding argument
   specially with all, potentially internal, bytes instances also treated as
   UTF8-encoded strings:

      '%r' % b'\xce\xb2'      ->  "b'β'"
      '%r' % [b'\xce\xb2']    ->  "[b'β']"

For "2" we implement %-format parsing ourselves. test_strings_mod
has good coverage for this phase to make sure we get it right and behaving
exactly the same way as standard Python does.

For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
from under bstr.__mod__(). See _bstringify for details.

For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
See _bstringify_repr for details.

I initially tried to avoid parsing format specification myself and
wanted to reuse original bytes.__mod__ and just adjust its behaviour
a bit somehow. This did not worked quite right as the following comment
explains:

    # Rejected alternative: try to format; if we get "TypeError: %b requires a
    # bytes-like object ..." retry with that argument converted to bstr.
    #
    # Rejected because e.g. for  `%(x)s %(x)r` % {'x': obj}`  we need to use
    # access number instead of key 'x' to determine which accesses to
    # bstringify. We could do that, but unfortunately on Python2 the access
    # number is not easily predictable because string could be upgraded to
    # unicode in the midst of being formatted and so some access keys will be
    # accesses not once.
    #
    # Another reason for rejection: b'%r' and u'%r' handle arguments
    # differently - on b %r is aliased to %a.

That's why full %-format parsing and handling is implemented in this
patch. Once again to make sure its behaviour is really the same compared
to Python's builtin %-formatting, we have good test coverage for both
%-format parsing itself, and for actual formatting of many various cases.

See test_strings_mod for details.

390fd810

golang_str: Teach bstr/ustr to stringify bytes as UTF-8 bytestrings even inside containers · ddf6958b

Kirill Smelkov authored Oct 08, 2022

bstr/ustr constructors either convert or stringify its argument. For
example bstr(u'α') gives b('α') while bstr(1) gives b('1'). And if the
argument is bytes, bstr treats it as UTF-8 encoded bytestring:

    >>> x = u'β'.encode()
    >>> x
    b'\xce\xb2'
    >>> bstr(x)
    b('β')

however if that same bytes argument is placed inside container - e.g. inside
list - currently it is not stringified as bytestring:

    >>> bstr([x])
    b("[b'\\xce\\xb2']")	<-- NOTE not b("['β']")

which is not consistent with our intended approach that bstr/ustr treat
bytes in their arguments as UTF-8 encoded strings.

This happens because when a list is stringified, list.__str__
implementation goes through its arguments and invokes __repr__ of the
arguments. And in general a container might be arbitrary deep, e.g. dict
-> list -> list -> bytes, and even when stringifying that deep dict, we
want to handle that leaf bytes as UTF-8 encoded string.

There are many containers in Python - lists, tuples, dicts,
collections.OrderedDict, collections.UserDict, collections.namedtuple,
collections.defaultdict, etc, and also there are many user-defined
containers - including implemented at C level - which we can not even
know all in advance.

It means that we cannot do some, probably deep/recursive typechecking,
inside bstringify and implement kind of parallel stringification of
arbitrary complex structure with adjustment to stringification of bytes.
We cannot also create object clone - for stringification - with bytes
instances replaced with str (e.g. via DeepReplacer - see recent previous
patch), and then stringify the clone. That would generally be incorrect,
because in this approach we cannot know whether an object is being
stringified as it is, or whether it is being used internally for data
storage and is not stringified directly. In the latter case if we
replace bytes with unicode, it might break internal invariant of custom
container class and break its logic.

What we can do however, is to hook into bytes.__repr__ implementations,
and to detect - if this implementation is called from under bstringify -
then we know we should adjust it and treat this bytes as bytestring.
Else - use original bytes.__repr__ implementation. This way we can handle
arbitrary complex data structures.

Hereby patch implements that approach for bytes, unicode on py2, and for
bytearray. See added comments that start with

    # patch bytes.{__repr__,__str__} and ...

for details.

After this patch stringification of bytes inside containers treat them
as UTF-8 bytestrings:

    >>> bstr([x])
    b("['β']")

ddf6958b

golang_str: bstr/ustr string methods · ff24be3d

Kirill Smelkov authored Oct 07, 2022

Take all str/unicode methods, such as .capitalize(), .split(), .join(),
etc, and implement them for bstr/ustr. For example bstr.split() behaves
like unicode.split(), but returns list of bstr instead of list of
unicode. And similarly for all other methods.

Organize testing of this via verifying every method behaviour on all
unicode and bstr/ustr. If the results match by modulo of deep replacing
unicode to bstr/ustr - everything is ok.

ff24be3d

golang_str: tests: Deep replacer · 2c20c055

Kirill Smelkov authored Oct 07, 2022

deepReplace returns object's clone with replacing all internal objects
selected by predicate via provided replacement function. We will use
this functionality in the following patches to organize testing of
bstr/ustr methods: a method would be first invoked on regular str, and
then on bstr/ustr and the result will be compared against each other.
The results are usually different, because e.g. u'a b c'.split() returns
[u'a', u'b', u'c'] while b('a b c').split() should return
[b('a'), b('b'), b('c')]. We want to make sure that the second result is
exactly the first result with all instances of unicode replaced by bstr.
That's where deep replacer will be used.

The deep replacement itself is implemented via pickle reduce/rebuild
protocol: we unassemble and reconstruct objects. And while an object is
unassembled, we try to apply the replacement recursively. Since this is
not so trivial functionality, it itself also comes with a test.

2c20c055

golang_str: Fix bstr.tp_print(flags=print_repr) · 510cf8d1

Kirill Smelkov authored Oct 07, 2022

On py2 objects are printed via their .tp_repr slot with flags=0
(contrary to Py_PRINT_RAW which requests to print str -
https://docs.python.org/2.7/c-api/object.html#c.PyObject_Print)

We were not handling repr'ing inside our tp_print implementation, and
as the result e.g. b('мир') was printed on interactive console as
'\xd0\xbc\xd0\xb8\xd1\x80' instead of b('мир').

Fix it.

510cf8d1

golang_str: bstr/ustr repr · 386844d3

Kirill Smelkov authored Oct 07, 2022

Teach bstr/ustr to provide repr of themselves: it goes as b(...) and
u(...) where u stands for human-readable repr of contained data.
Human-readable means that non-ascii printable unicode characters are
shown as-is instead of escaping them, for example:

    >>> x = u'αβγ'
    >>> x
    'αβγ'
    >>> y = b(x)
    >>> y
    b('αβγ')				<-- NOTE not b(b'\xce\xb1\xce\xb2\xce\xb3')
    >>> x.encode('utf-8')
    b'\xce\xb1\xce\xb2\xce\xb3'

386844d3

strconv, golang_str: Switch quote, unquote and qq to always return bstr · 604a7765

Kirill Smelkov authored Oct 07, 2022

bstr is becoming the default pygolang string type. And it can be mixed
ok with all bytes/unicode and ustr. Previously e.g. strconv.quote was
checking which kind of type its input was and was trying to return the
result of the same type. Now this becomes unnecessary since bstr is
intended to be used universally and interoperable with all other string
types.

604a7765

golang_str: bstr/ustr support for + and * · bbbb58f0

Kirill Smelkov authored Oct 07, 2022

Add support for +, *, += and *= operators to bstr and ustr.

For * rhs should be integer and the result, similarly to std strings, is
repetition of rhs times.

For + the other argument could be any supported string - bstr/ustr /
unicode/bytes/bytearray. And the result is always bstr or ustr:

    u()   +     *     ->  u()
    b()   +     *     ->  b()
    u''   +  u()/b()  ->  u()
    u''   +  u''      ->  u''
    b''   +  u()/b()  ->  b()
    b''   +      b''  ->  b''
    barr  +  u()/b()  ->  barr

in particular if lhs is bstr or ustr, the result will remain exactly of
original lhs type. This should be handy when one has e.g. bstr at hand
and wants to incrementally append something to it.

And if lhs is bytes/unicode, but we append bstr/ustr to it, we "upgrade"
the result to bstr/ustr correspondingly. Only if lhs is bytearray it
remains to stay that way because it is logical for appended object to
remain mutable if it was mutable in the beginning.

As before bytearray.__add__ and friends need to patched a bit for
bytearray not to reject ustr.

bbbb58f0

golang_str: bstr/ustr pickle support · ebd18f3f

Kirill Smelkov authored Oct 07, 2022

Without explicitly overriding __reduce_ex__ pickling was failing for
protocols < 2:

    _________________________ test_strings_pickle __________________________

        def test_strings_pickle():
            bs = b("мир")
            us = u("май")

            #from pickletools import dis
            for proto in range(0, pickle.HIGHEST_PROTOCOL):
    >           p_bs = pickle.dumps(bs, proto)

    golang/golang_str_test.py:282:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

    self = b'\xd0\xbc\xd0\xb8\xd1\x80', proto = 0

        def _reduce_ex(self, proto):
    >       assert proto < 2
    E       RecursionError: maximum recursion depth exceeded in comparison

    /usr/lib/python3.9/copyreg.py:56: RecursionError

See added comments for details.

ebd18f3f

golang_str: bstr/ustr iteration · a72c1c1a

Kirill Smelkov authored Oct 07, 2022

Even though bstr is semantically array of bytes, while ustr is array of
unicode characters, iterating them _both_ yields unicode characters.
This goes in line with Go approach described in "Strings, bytes, runes
and characters in Go"[1] and allows for both ustr _and_ bstr to be used
as strings in unicode world.

Even though this diverges (just a bit) from str/py2 str behaviur, and
diverges more from bytes/py3 behaviour, I have not hit any problem in
practice due to this divergence. In other words the semantics of
bytestring used in Go - to iterate them as unicode characters - is
sound. For the reference it is the authors of Go who originally invented
UTF-8 - see [2] for details.

See also [3] for our discussion with Jérome on this topic.

[1] https://blog.golang.org/strings
[2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
[3] nexedi/zodbtools!13 (comment 81646)

a72c1c1a

golang_str: bstr/ustr index access · 04be919b

Kirill Smelkov authored Oct 07, 2022

Implement access to bstr/ustr by [index] and by slice. Result of such
[index] access - similarly to standard str - returns the same bstr/ustr
type with one character:

  - ustr[i] returns ustr with one unicode character taken from i'th character of original string, while
  - bstr[i] returns bstr with one byte taken from i'th byte of original bytestring.

This follows str/unicode semantics on both py2/py3, bytes semantic on
py2, but diverges from bytes semantics on py3. I originally tried to
follow bytes/py3 semantic - for bstr to return an integer instead of
1-byte character, but later found several compatibility breakages due to
it. I contemplated about this divergence for a long time and finally
took decision to follow strings semantics for both ustr and bstr. This
preserves backward compatibility with Python2 and also allows for bstr
to be practically drop-in replacement for str type.

To get an ordinal corresponding to retrieved character, one can use
standard `ord`, e.g. as in `ord(bstr[i])`. This will always return an
integer for all bstr/ustr/str/unicode. Similarly to standard `chr` and
`unichr`, we also provide two utility functions - `uchr` and `bbyte` to
create 1-character and 1-byte ustr/bstr correspondingly.

04be919b

golang_str: Add test for memoryview(bstr) · 105d03d4

Kirill Smelkov authored Oct 07, 2022

Verify that it works as expected, and that memoryview(ustr) is rejected,
because ustr is semantically array of unicode characters, not bytes.

No change to the code - just add tests for current status which is
already working as expected.

105d03d4

golang_str: Teach b/u to accept objects with buffer interface · d7e55bb0

Kirill Smelkov authored Oct 07, 2022

And to convert them to bstr/ustr decoding buffer data as if it was
bytes. This is needed if e.g. we have data in mmap or numpy.ndarray, and
want to convert the data to string. The conversion is always explicit via
explicit call to b/u. And for bstr/ustr constructors, we preserver their
behaviour to match unicode constructor not to convert automatically, but
instead to stringify the object, e.g. as shown below:

    In [1]: bdata = b'hello 123'

    In [2]: mview = memoryview(bdata)

    In [3]: str(mview)
    Out[3]: '<memory at 0x7fb226b26700>'	# NOTE _not_ b'hello 123'

d7e55bb0

golang_str: Treat bytearray also as bytestring, just mutable · e4d5cb21

Kirill Smelkov authored Oct 07, 2022

bytearray was introduced in Python as a mutable version of bytes. It has
all strings methods (e.g. .capitalize() .islower(), etc), and it also
supports % formatting. In other words it has all attributes of being a
byte-string, with the only difference from bytes in that bytearray is
mutable. In other words bytearray is handy to have when a string is
being incrementally constructed step by step without hitting overhead of
many bytes objects creation/destruction.

So, since bytearray is also a bytestring, similarly to bytes, let's add
support to interoperate with bytearray to bstr and ustr:

- b/u and bstr/ustr now accept bytearray as argument and treat it as bytestring.
- bytearray() constructor, similarly to bytes() and unicode()
  constructors, now also accepts bstr/ustr and create bytearray object
  corresponding to byte-stream of input.

For the latter point to work we need to patch bytearray.__init__() a bit,
since, contrary to bytes.__init__(), it does not pay attention to
whether provided argument has __bytes__ method or not.

e4d5cb21

golang_str: Implement bstr/ustr constructors · 781802d4

Kirill Smelkov authored Oct 06, 2022

Both bstr and ustr constructors mimic constructor of unicode(= str on py3) -
an object is either stringified, or decoded if it provides buffer
interface, or the constructor is invoked with optional encoding and
errors argument:

    # py2
    class unicode(basestring)
     |  unicode(object='') -> unicode object
     |  unicode(string[, encoding[, errors]]) -> unicode object

    # py3
    class str(object)
     |  str(object='') -> str
     |  str(bytes_or_buffer[, encoding[, errors]]) -> str

Stringification of all bstr/ustr / unicode/bytes is handled
automatically with the meaning to convert to created type via b or u.

We follow unicode semantic for both ustr _and_ bstr, because bstr/ustr
are intended to be used as strings.

781802d4

golang_str: Teach bstr/ustr to compare wrt any string with automatic coercion · 54c2a3cf

Kirill Smelkov authored Oct 05, 2022

So that e.g. `bstr == <any string type>` works. We want `bstr == ustr`
to work because we intend those types to be interoperable. We also want
e.g. `bstr == "a_string"` to work because we want bstr to be
interoperable with standard strings. In general we want to have full
automatic interoperability with all string types, so that e.g. `bstr == X`
works for X being all bstr, ustr, unicode, bytes (and later bytearray).

For now we add support only for comparison operators. But later, we
will be adding support for e.g. +, string methods, etc - and in all
those operations we will be following the same approach: to have
automatic interoperability with all string types out of the box.

The text added to README reflects this.

The patch to unicode.tp_richcompare on py2 illustrates our approach to
adjust builtin types when absolutely needed. In this particular case
original builtin unicode.__eq__(unicode, bstr) is always returning False
for non-ASCII bstr even despite bstr having .__unicode__() method. Our
adjustment is non-intrusive - we adjust unicode behaviour only wrt bstr
and it stays exactly the same as before wrt all other types.

We anyway do that with care and add a test that verifies that behaviour
of what we patched stays unaffected when used outside of bstr/ustr
context.

54c2a3cf

golang_str: Infrastructure to patch builtin types · 34667355

Kirill Smelkov authored Oct 06, 2022

_patch_slot(typ, slotname, func) installs func into typ's
dict[slotname]. For example in the next patch we will need to adjust
unicode.__eq__ on py2 not to reject bstr with always assuming that
`unicode == bstr` is False. We will do it via patching unicode.__eq__ to
first check rhs or whether it is bstr and handling that with our code,
while tailing to original unicode.__eq__ for all other types.

34667355

golang_str: Refresh b/u and bstr/ustr docstrings · 88b21b40

Kirill Smelkov authored Oct 05, 2022

Document explicitly which types b/u accept and how they are handled.
Change bstr/ustr docstrings to also be more explicit.

Documentation changes only.

88b21b40

golang_str: Make bytes(bstr) -> bstr, unicode(ustr) -> ustr · b7cda092

Kirill Smelkov authored Oct 05, 2022

In other words casting to bytes/unicode preserves pygolang string to
remain pygolang string.

Without the changes to bstr/ustr added test fails as e.g.

    >       assert bytes  (bs) is bs
    E       AssertionError: assert b'\xd0\xbc\xd0\xb8\xd1\x80' is b'\xd0\xbc\xd0\xb8\xd1\x80'
    E        +  where b'\xd0\xbc\xd0\xb8\xd1\x80' = bytes(b'\xd0\xbc\xd0\xb8\xd1\x80')

in other words bytes(bstr) was creating a copy and changing type to bytes.

b7cda092

golang_str: Extend tests a bit · 85c4615d

Kirill Smelkov authored Oct 05, 2022

Extend current coverage for b/u tests more explicitly verifying
resulting type (`type(·) is ...` instead of `isinstance(·, ...)`),
verifying unicode(bstr)->ustr and bytes(ustr)->bstr, and str() of both
bstr and ustr.

Move the check for "no custom attributes" from test_qq to generic
test_strings_basic, because now verified string types are publicly
accessible, not only via qq.

Small cosmetics in benchmarks - by reusing hereby introduced xbytes()
utility.

No change for the code itself - the tests just add verification to
current status.

85c4615d

08 Oct, 2022 1 commit

golang_str: Start exposing Pygolang string types publicly · 1f99393d

Kirill Smelkov authored Oct 05, 2022

In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and
str format whatever type qq argument is) I added custom bytes- and
unicode- like types for qq to return instead of str with the idea for
qq's result to be interoperable with both bytes and unicode. Citing that patch:

    qq is used to quote strings or byte-strings. The following example
    illustrates the problem we are currently hitting in zodbtools with
    Python3:

        >>> "hello %s" % qq("мир")
        'hello "мир"'

        >>> b"hello %s" % qq("мир")
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

        >>> "hello %s" % qq(b("мир"))
        'hello "мир"'

        >>> b"hello %s" % qq(b("мир"))
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'

    i.e. one way or another if type of format string and what qq returns do not
    match it creates a TypeError.

    We want qq(obj) to be useable with both string and bytestring format.

    For that let's teach qq to return special str- and bytes- derived types that
    know how to automatically convert to str->bytes and bytes->str via b/u
    correspondingly. This way formatting works whatever types combination it was
    for format and for qq, and the whole result has the same type as format.

    For now we teach only qq to use new types and don't generally expose
    _str and _unicode to be returned by b and u yet. However we might do so
    in the future after incrementally gaining a bit more experience.

So two years later I gained that experience and found that having string
type, that can interoperate with both bytes and unicode, is generally
useful. It is useful for practical backward compatibility with Python2
and for simplicity of programming avoiding constant stream of
encode/decode noise. Thus the day to expose Pygolang string types for
general use has come.

This patch does the first small step: it exposes bytes- and unicode-
like types (now named as bstr and ustr) publicly. It switches b and u to
return bstr and ustr correspondingly instead of bytes and unicode. This
is change in behaviour, but hopefully it should not break anything as
there are not many b/u users currently and bstr and ustr are intended to
be drop-in replacements for standard string types.

Next patches will enhance bstr/ustr step by step to be actually drop-in
replacements for standard string types for real.

See nexedi/zodbtools!13 (comment 81646)
for preliminary discussion from 2019.

See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost
overview"[2] for related presentation by Jean-Paul from 2018.

[1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20

1f99393d

05 Oct, 2022 2 commits

py.bench: Automatically discover benchmarks in test files · ffb40903

Kirill Smelkov authored Oct 04, 2022

Since the beginning (9bf03d9c "py.bench: New command to benchmark python
code similarly to `go test -bench`") py.bench was automatically
discovering benchmarks in bench_*.py files only. This was inherited from
wendelin.core which keeps its benchmarks in those files.

However in pygolang, following Go convention(*), we already have several
benchmarks that reside together with tests in same *_test.py files. And
currently just running py.bench does not discover them.

-> Let's fix this and teach py.bench to automatically discover
benchmarks in the test files by default as well.

Pytest's default is to look for tests in test_*.py and *_test.py (+).
Add those patterns and also keep bench_*.py for backward compatibility.

Before this patch running py.bench inside pygolang repository does not
run any benchmark at all. After the patch py.bench runs all the
benchmarks by default:

    (z-dev) kirr@deca:~/src/tools/go/pygolang$ py.bench
    ========================= test session starts ==========================
    platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
    rootdir: /home/kirr/src/tools/go/pygolang
    plugins: timeout-1.4.2, profiling-1.7.0, mock-2.0.0
    collected 18 items

    pymod: golang/golang_str_test.py
    Benchmarkstddecode              2000000 0.756 µs/op
    Benchmarkudecode                20000   74.359 µs/op
    Benchmarkstdencode              3000000 0.327 µs/op
    Benchmarkbencode                40000   32.613 µs/op

    pymod: golang/golang_test.py
    Benchmarkpyx_select_nogil       500000  2.051 µs/op
    Benchmarkpyx_go_nogil           90000   12.177 µs/op
    Benchmarkpyx_chan_nogil         600000  1.826 µs/op
    Benchmarkgo                     80000   13.267 µs/op
    Benchmarkchan                   500000  2.076 µs/op
    Benchmarkselect                 300000  3.835 µs/op
    Benchmarkdef                    30000000        0.035 µs/op
    Benchmarkfunc_def               40000   29.387 µs/op
    Benchmarkcall                   30000000        0.043 µs/op
    Benchmarkfunc_call              2000000 0.819 µs/op
    Benchmarktry_finally            20000000        0.096 µs/op
    Benchmarkdefer                  600000  1.755 µs/op

    pymod: golang/sync_test.py
    Benchmarkworkgroup_empty        40000   25.807 µs/op
    Benchmarkworkgroup_raise        40000   31.637 µs/op                     [100%]

    =========================== warnings summary ===========================

(*) see https://pkg.go.dev/cmd/go#hdr-Test_packages
(+) see https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_files

/reviewed-by @jerome
/reviewed-on nexedi/pygolang!20

ffb40903

golang_str: Speedup utf-8 decoding a bit on py2 · 9cb7b210

Kirill Smelkov authored Oct 04, 2022

We recently moved our custom UTF-8 encoding/decoding routines to Cython.
Now we can start taking speedup advantage on C level to make our own
UTF-8 decoder a bit less horribly slow on py2:

    name       old time/op  new time/op  delta
    stddecode   752ns ± 0%   743ns ± 0%   -1.19%  (p=0.000 n=9+10)
    udecode     216µs ± 0%    75µs ± 0%  -65.19%  (p=0.000 n=9+10)
    stdencode   328ns ± 2%   327ns ± 1%     ~     (p=0.252 n=10+9)
    bencode    34.1µs ± 1%  32.1µs ± 1%   -5.92%  (p=0.000 n=10+10)

So it is ~ 3x speedup for u(), but still significantly slower compared
to std unicode.decode('utf-8').

Only low-hanging fruit here to make _utf_decode_rune a bit more prompt,
since it sits in the most inner loop. In the future
_utf8_decode_surrogateescape might be reworked as well to avoid
constructing resulting unicode via py-level list of py-unicode character
objects. And similarly for _utf8_encode_surrogateescape.

On py3 the performance of std and u/b decode/encode is approximately the same.

/trusted-by @jerome
/reviewed-on nexedi/pygolang!19

9cb7b210

04 Oct, 2022 4 commits

golang_str,strconv: Fix decoding of rune-error · 598eb479

Kirill Smelkov authored Oct 03, 2022

Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an
error in decoding. But the error rune itself is valid unicode codepoint:

   >>> x = u"�"
   >>> x
   u'\ufffd'
   >>> x.encode('utf-8')
   '\xef\xbf\xbd'

This way only (r=_rune_error, size=1) should be treated by the caller as
utf8 decoding error.

But e.g. strconv.quote was not careful to also inspect the size, and this way
was quoting � into just "\xef" instead of "\xef\xbf\xbd".
_utf8_decode_surrogateescape was also subject to similar error.

-> Fix it.

Without the fix e.g. added test for strconv.quote fails as

    >           assert quote(tin) == tquoted
    E           assert '"\xef"' == '"�"'
    E             - "\xef"
    E             + "�"

/reviewed-by @jerome
/reviewed-at nexedi/pygolang!18

598eb479

golang_str: Move py3/py2 conditioning into _utf8_{encode,decode}_surrogateescape · ea5abe71

Kirill Smelkov authored Oct 03, 2022

So that those routines could be just called and do what is expected
without the caller caring whether it is py2 or py3. We will soon need to
use those routines from several callsites, and having that py2/py3
conditioning being spread over all usage places would be inconvenient.

/reviewed-by @jerome
/reviewed-at nexedi/pygolang!18

ea5abe71

strconv: Move functionality related to UTF8 encode/decode into _golang_str · 50b8cb7e

Kirill Smelkov authored Oct 03, 2022

- Move _utf8_decode_rune, _utf8_decode_surrogateescape, _utf8_encode_surrogateescape out from strconv into _golang_str
- Factor _bstr/_ustr code into pyb/pyu. _bstr/_ustr become plain wrappers over pyb/pyu.
- work-around emerged golang ↔ strconv dependency with at-runtime import.

Moved routines belong to the main part of golang strings processing
-> their home should be in _golang_str.pyx

/reviewed-by @jerome
/reviewed-at nexedi/pygolang!18

50b8cb7e

golang: Move strings-related code to _golang_str "submodule" · e72a459f

Kirill Smelkov authored Oct 03, 2022

We are going to significantly extend py-strings related functionality soon
- to the point where amount of strings related code will be
approximately the same compared to the amount of all other
python-related code inside golang module.

-> First move everything related to py strings to dedicated
_golang_str.pyx as a preparatory step.

Keep that new file included from _golang.pyx instead of being real new
module, because we want strings functionality to be provided by golang
main namespace itself, and to ease internal code interdependencies.

Plain code movement.

/reviewed-by @jerome
/reviewed-at nexedi/pygolang!18

e72a459f

26 Jan, 2022 6 commits

pygolang v0.1 · 7b72d418
Kirill Smelkov authored Jan 26, 2022

7b72d418

golang: Fix print(_pystr) · 08dc5d10

Kirill Smelkov authored Jan 24, 2022

On Python2 without .tp_print printing _pystr crashes as:

    pygolang$ ./golang/testprog/golang_test_str.py
    Traceback (most recent call last):
      File "./golang/testprog/golang_test_str.py", line 39, in <module>
        main()
      File "./golang/testprog/golang_test_str.py", line 34, in main
        print("print(qq(b)):", qq(sb))
    RuntimeError: print recursion

See added comments for details.

08dc5d10

os += ReadFile · 2a35ef5b

Kirill Smelkov authored Jan 26, 2022

Add convenient utility to read whole file and return its content
similarly to Go. The code is taken from wendelin.core:

https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.cpp#L246-281

2a35ef5b

Nogil signals · e18adbab

Kirill Smelkov authored Jan 24, 2022

Provide os/signal package that can be used to setup signal delivery to nogil
channels. This way for user code signal handling becomes regular handling of a
signalling channel instead of being something special or limited to only-main
python thread. The rationale for why we need it is explained below:

There are several problems with regular python's stdlib signal module:

1. Python2 does not call signal handler from under blocked lock.acquire.
This means that if the main thread is blocked waiting on a semaphore,
signal delivery will be delayed indefinitely, similarly to e.g. problem
described in nexedi/nxdtest!14 (comment 147527)
where raising KeyboardInterrupt is delayed after SIGINT for many,
potentially unbounded, seconds until ~semaphore wait finishes.

Note that Python3 does not have this problem wrt stdlib locks and
semaphores, but read below for the next point.

2. all pygolang communication operations (channels send/recv, sync.Mutex,
sync.RWMutex, sync.Sema, sync.WaitGroup, sync.WorkGroup, ...) run with
GIL released, but if blocked do not handle EINTR and do not schedule
python signal handler to run (on main thread).

Even if we could theoretically adjust this behaviour of pygolang at python
level to match Python3, there are also C++ and pyx/nogil worlds. And we want gil
and nogil worlds to interoperate (see https://pypi.org/project/pygolang/#cython-nogil-api),
so that e.g. if completely nogil code happens to run on the main thread,
signal handling is still possible, even if that signal handling was setup at
python level.

With signals delivered to nogil channels both nogil world and python
world can setup signal handlers and to be notified of them irregardles
of whether main python thread is currently blocked in nogil wait or not.

/reviewed-on nexedi/pygolang!17

e18adbab

golang: Provide __pystr internally · ce507f4e

Kirill Smelkov authored Jan 24, 2022

To convert an object to str of current python.
It will be handy to use __pystr when implementing __str__ methods.

/reviewed-on nexedi/pygolang!17

ce507f4e

Nogil IO · 4690460b

Kirill Smelkov authored Jan 24, 2022

Provide C++ package "os" with File, Pipe, etc similarly to what is
provided on Go side. The package works through IO methods provided by
runtimes.

We need IO facility because os/signal package will need to use
pipe in cooperative IO mode in its receiving-loop goroutine.

os.h and os.cpp are based on drafts from wendelin.core:

https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.h
https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.cpp

/reviewed-on !17

4690460b