1. 09 Oct, 2022 17 commits
    • Kirill Smelkov's avatar
      golang_str: tests: Deep replacer · 2c20c055
      Kirill Smelkov authored
      deepReplace returns object's clone with replacing all internal objects
      selected by predicate via provided replacement function. We will use
      this functionality in the following patches to organize testing of
      bstr/ustr methods: a method would be first invoked on regular str, and
      then on bstr/ustr and the result will be compared against each other.
      The results are usually different, because e.g. u'a b c'.split() returns
      [u'a', u'b', u'c'] while b('a b c').split() should return
      [b('a'), b('b'), b('c')]. We want to make sure that the second result is
      exactly the first result with all instances of unicode replaced by bstr.
      That's where deep replacer will be used.
      
      The deep replacement itself is implemented via pickle reduce/rebuild
      protocol: we unassemble and reconstruct objects. And while an object is
      unassembled, we try to apply the replacement recursively. Since this is
      not so trivial functionality, it itself also comes with a test.
      2c20c055
    • Kirill Smelkov's avatar
      golang_str: Fix bstr.tp_print(flags=print_repr) · 510cf8d1
      Kirill Smelkov authored
      On py2 objects are printed via their .tp_repr slot with flags=0
      (contrary to Py_PRINT_RAW which requests to print str -
      https://docs.python.org/2.7/c-api/object.html#c.PyObject_Print)
      
      We were not handling repr'ing inside our tp_print implementation, and
      as the result e.g. b('мир') was printed on interactive console as
      '\xd0\xbc\xd0\xb8\xd1\x80' instead of b('мир').
      
      Fix it.
      510cf8d1
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr repr · 386844d3
      Kirill Smelkov authored
      Teach bstr/ustr to provide repr of themselves: it goes as b(...) and
      u(...) where u stands for human-readable repr of contained data.
      Human-readable means that non-ascii printable unicode characters are
      shown as-is instead of escaping them, for example:
      
          >>> x = u'αβγ'
          >>> x
          'αβγ'
          >>> y = b(x)
          >>> y
          b('αβγ')				<-- NOTE not b(b'\xce\xb1\xce\xb2\xce\xb3')
          >>> x.encode('utf-8')
          b'\xce\xb1\xce\xb2\xce\xb3'
      386844d3
    • Kirill Smelkov's avatar
      strconv, golang_str: Switch quote, unquote and qq to always return bstr · 604a7765
      Kirill Smelkov authored
      bstr is becoming the default pygolang string type. And it can be mixed
      ok with all bytes/unicode and ustr. Previously e.g. strconv.quote was
      checking which kind of type its input was and was trying to return the
      result of the same type. Now this becomes unnecessary since bstr is
      intended to be used universally and interoperable with all other string
      types.
      604a7765
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr support for + and * · bbbb58f0
      Kirill Smelkov authored
      Add support for +, *, += and *= operators to bstr and ustr.
      
      For * rhs should be integer and the result, similarly to std strings, is
      repetition of rhs times.
      
      For + the other argument could be any supported string - bstr/ustr /
      unicode/bytes/bytearray. And the result is always bstr or ustr:
      
          u()   +     *     ->  u()
          b()   +     *     ->  b()
          u''   +  u()/b()  ->  u()
          u''   +  u''      ->  u''
          b''   +  u()/b()  ->  b()
          b''   +      b''  ->  b''
          barr  +  u()/b()  ->  barr
      
      in particular if lhs is bstr or ustr, the result will remain exactly of
      original lhs type. This should be handy when one has e.g. bstr at hand
      and wants to incrementally append something to it.
      
      And if lhs is bytes/unicode, but we append bstr/ustr to it, we "upgrade"
      the result to bstr/ustr correspondingly. Only if lhs is bytearray it
      remains to stay that way because it is logical for appended object to
      remain mutable if it was mutable in the beginning.
      
      As before bytearray.__add__ and friends need to patched a bit for
      bytearray not to reject ustr.
      bbbb58f0
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr pickle support · ebd18f3f
      Kirill Smelkov authored
      Without explicitly overriding __reduce_ex__ pickling was failing for
      protocols < 2:
      
          _________________________ test_strings_pickle __________________________
      
              def test_strings_pickle():
                  bs = b("мир")
                  us = u("май")
      
                  #from pickletools import dis
                  for proto in range(0, pickle.HIGHEST_PROTOCOL):
          >           p_bs = pickle.dumps(bs, proto)
      
          golang/golang_str_test.py:282:
          _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
      
          self = b'\xd0\xbc\xd0\xb8\xd1\x80', proto = 0
      
              def _reduce_ex(self, proto):
          >       assert proto < 2
          E       RecursionError: maximum recursion depth exceeded in comparison
      
          /usr/lib/python3.9/copyreg.py:56: RecursionError
      
      See added comments for details.
      ebd18f3f
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr iteration · a72c1c1a
      Kirill Smelkov authored
      Even though bstr is semantically array of bytes, while ustr is array of
      unicode characters, iterating them _both_ yields unicode characters.
      This goes in line with Go approach described in "Strings, bytes, runes
      and characters in Go"[1] and allows for both ustr _and_ bstr to be used
      as strings in unicode world.
      
      Even though this diverges (just a bit) from str/py2 str behaviur, and
      diverges more from bytes/py3 behaviour, I have not hit any problem in
      practice due to this divergence. In other words the semantics of
      bytestring used in Go - to iterate them as unicode characters - is
      sound. For the reference it is the authors of Go who originally invented
      UTF-8 - see [2] for details.
      
      See also [3] for our discussion with Jérome on this topic.
      
      [1] https://blog.golang.org/strings
      [2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
      [3] nexedi/zodbtools!13 (comment 81646)
      a72c1c1a
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr index access · 04be919b
      Kirill Smelkov authored
      Implement access to bstr/ustr by [index] and by slice. Result of such
      [index] access - similarly to standard str - returns the same bstr/ustr
      type with one character:
      
        - ustr[i] returns ustr with one unicode character taken from i'th character of original string, while
        - bstr[i] returns bstr with one byte taken from i'th byte of original bytestring.
      
      This follows str/unicode semantics on both py2/py3, bytes semantic on
      py2, but diverges from bytes semantics on py3. I originally tried to
      follow bytes/py3 semantic - for bstr to return an integer instead of
      1-byte character, but later found several compatibility breakages due to
      it. I contemplated about this divergence for a long time and finally
      took decision to follow strings semantics for both ustr and bstr. This
      preserves backward compatibility with Python2 and also allows for bstr
      to be practically drop-in replacement for str type.
      
      To get an ordinal corresponding to retrieved character, one can use
      standard `ord`, e.g. as in `ord(bstr[i])`. This will always return an
      integer for all bstr/ustr/str/unicode. Similarly to standard `chr` and
      `unichr`, we also provide two utility functions - `uchr` and `bbyte` to
      create 1-character and 1-byte ustr/bstr correspondingly.
      04be919b
    • Kirill Smelkov's avatar
      golang_str: Add test for memoryview(bstr) · 105d03d4
      Kirill Smelkov authored
      Verify that it works as expected, and that memoryview(ustr) is rejected,
      because ustr is semantically array of unicode characters, not bytes.
      
      No change to the code - just add tests for current status which is
      already working as expected.
      105d03d4
    • Kirill Smelkov's avatar
      golang_str: Teach b/u to accept objects with buffer interface · d7e55bb0
      Kirill Smelkov authored
      And to convert them to bstr/ustr decoding buffer data as if it was
      bytes. This is needed if e.g. we have data in mmap or numpy.ndarray, and
      want to convert the data to string. The conversion is always explicit via
      explicit call to b/u. And for bstr/ustr constructors, we preserver their
      behaviour to match unicode constructor not to convert automatically, but
      instead to stringify the object, e.g. as shown below:
      
          In [1]: bdata = b'hello 123'
      
          In [2]: mview = memoryview(bdata)
      
          In [3]: str(mview)
          Out[3]: '<memory at 0x7fb226b26700>'	# NOTE _not_ b'hello 123'
      d7e55bb0
    • Kirill Smelkov's avatar
      golang_str: Treat bytearray also as bytestring, just mutable · e4d5cb21
      Kirill Smelkov authored
      bytearray was introduced in Python as a mutable version of bytes. It has
      all strings methods (e.g. .capitalize() .islower(), etc), and it also
      supports % formatting. In other words it has all attributes of being a
      byte-string, with the only difference from bytes in that bytearray is
      mutable. In other words bytearray is handy to have when a string is
      being incrementally constructed step by step without hitting overhead of
      many bytes objects creation/destruction.
      
      So, since bytearray is also a bytestring, similarly to bytes, let's add
      support to interoperate with bytearray to bstr and ustr:
      
      - b/u and bstr/ustr now accept bytearray as argument and treat it as bytestring.
      - bytearray() constructor, similarly to bytes() and unicode()
        constructors, now also accepts bstr/ustr and create bytearray object
        corresponding to byte-stream of input.
      
      For the latter point to work we need to patch bytearray.__init__() a bit,
      since, contrary to bytes.__init__(), it does not pay attention to
      whether provided argument has __bytes__ method or not.
      e4d5cb21
    • Kirill Smelkov's avatar
      golang_str: Implement bstr/ustr constructors · 781802d4
      Kirill Smelkov authored
      Both bstr and ustr constructors mimic constructor of unicode(= str on py3) -
      an object is either stringified, or decoded if it provides buffer
      interface, or the constructor is invoked with optional encoding and
      errors argument:
      
          # py2
          class unicode(basestring)
           |  unicode(object='') -> unicode object
           |  unicode(string[, encoding[, errors]]) -> unicode object
      
          # py3
          class str(object)
           |  str(object='') -> str
           |  str(bytes_or_buffer[, encoding[, errors]]) -> str
      
      Stringification of all bstr/ustr / unicode/bytes is handled
      automatically with the meaning to convert to created type via b or u.
      
      We follow unicode semantic for both ustr _and_ bstr, because bstr/ustr
      are intended to be used as strings.
      781802d4
    • Kirill Smelkov's avatar
      golang_str: Teach bstr/ustr to compare wrt any string with automatic coercion · 54c2a3cf
      Kirill Smelkov authored
      So that e.g. `bstr == <any string type>` works. We want `bstr == ustr`
      to work because we intend those types to be interoperable. We also want
      e.g. `bstr == "a_string"` to work because we want bstr to be
      interoperable with standard strings. In general we want to have full
      automatic interoperability with all string types, so that e.g. `bstr == X`
      works for X being all bstr, ustr, unicode, bytes (and later bytearray).
      
      For now we add support only for comparison operators. But later, we
      will be adding support for e.g. +, string methods, etc - and in all
      those operations we will be following the same approach: to have
      automatic interoperability with all string types out of the box.
      
      The text added to README reflects this.
      
      The patch to unicode.tp_richcompare on py2 illustrates our approach to
      adjust builtin types when absolutely needed. In this particular case
      original builtin unicode.__eq__(unicode, bstr) is always returning False
      for non-ASCII bstr even despite bstr having .__unicode__() method. Our
      adjustment is non-intrusive - we adjust unicode behaviour only wrt bstr
      and it stays exactly the same as before wrt all other types.
      
      We anyway do that with care and add a test that verifies that behaviour
      of what we patched stays unaffected when used outside of bstr/ustr
      context.
      54c2a3cf
    • Kirill Smelkov's avatar
      golang_str: Infrastructure to patch builtin types · 34667355
      Kirill Smelkov authored
      _patch_slot(typ, slotname, func) installs func into typ's
      dict[slotname]. For example in the next patch we will need to adjust
      unicode.__eq__ on py2 not to reject bstr with always assuming that
      `unicode == bstr` is False. We will do it via patching unicode.__eq__ to
      first check rhs or whether it is bstr and handling that with our code,
      while tailing to original unicode.__eq__ for all other types.
      34667355
    • Kirill Smelkov's avatar
      golang_str: Refresh b/u and bstr/ustr docstrings · 88b21b40
      Kirill Smelkov authored
      Document explicitly which types b/u accept and how they are handled.
      Change bstr/ustr docstrings to also be more explicit.
      
      Documentation changes only.
      88b21b40
    • Kirill Smelkov's avatar
      golang_str: Make bytes(bstr) -> bstr, unicode(ustr) -> ustr · b7cda092
      Kirill Smelkov authored
      In other words casting to bytes/unicode preserves pygolang string to
      remain pygolang string.
      
      Without the changes to bstr/ustr added test fails as e.g.
      
          >       assert bytes  (bs) is bs
          E       AssertionError: assert b'\xd0\xbc\xd0\xb8\xd1\x80' is b'\xd0\xbc\xd0\xb8\xd1\x80'
          E        +  where b'\xd0\xbc\xd0\xb8\xd1\x80' = bytes(b'\xd0\xbc\xd0\xb8\xd1\x80')
      
      in other words bytes(bstr) was creating a copy and changing type to bytes.
      b7cda092
    • Kirill Smelkov's avatar
      golang_str: Extend tests a bit · 85c4615d
      Kirill Smelkov authored
      Extend current coverage for b/u tests more explicitly verifying
      resulting type (`type(·) is ...` instead of `isinstance(·, ...)`),
      verifying unicode(bstr)->ustr and bytes(ustr)->bstr, and str() of both
      bstr and ustr.
      
      Move the check for "no custom attributes" from test_qq to generic
      test_strings_basic, because now verified string types are publicly
      accessible, not only via qq.
      
      Small cosmetics in benchmarks - by reusing hereby introduced xbytes()
      utility.
      
      No change for the code itself - the tests just add verification to
      current status.
      85c4615d
  2. 08 Oct, 2022 1 commit
    • Kirill Smelkov's avatar
      golang_str: Start exposing Pygolang string types publicly · 1f99393d
      Kirill Smelkov authored
      In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and
      str format whatever type qq argument is) I added custom bytes- and
      unicode- like types for qq to return instead of str with the idea for
      qq's result to be interoperable with both bytes and unicode. Citing that patch:
      
          qq is used to quote strings or byte-strings. The following example
          illustrates the problem we are currently hitting in zodbtools with
          Python3:
      
              >>> "hello %s" % qq("мир")
              'hello "мир"'
      
              >>> b"hello %s" % qq("мир")
              Traceback (most recent call last):
                File "<stdin>", line 1, in <module>
              TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'
      
              >>> "hello %s" % qq(b("мир"))
              'hello "мир"'
      
              >>> b"hello %s" % qq(b("мир"))
              Traceback (most recent call last):
                File "<stdin>", line 1, in <module>
              TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'
      
          i.e. one way or another if type of format string and what qq returns do not
          match it creates a TypeError.
      
          We want qq(obj) to be useable with both string and bytestring format.
      
          For that let's teach qq to return special str- and bytes- derived types that
          know how to automatically convert to str->bytes and bytes->str via b/u
          correspondingly. This way formatting works whatever types combination it was
          for format and for qq, and the whole result has the same type as format.
      
          For now we teach only qq to use new types and don't generally expose
          _str and _unicode to be returned by b and u yet. However we might do so
          in the future after incrementally gaining a bit more experience.
      
      So two years later I gained that experience and found that having string
      type, that can interoperate with both bytes and unicode, is generally
      useful. It is useful for practical backward compatibility with Python2
      and for simplicity of programming avoiding constant stream of
      encode/decode noise. Thus the day to expose Pygolang string types for
      general use has come.
      
      This patch does the first small step: it exposes bytes- and unicode-
      like types (now named as bstr and ustr) publicly. It switches b and u to
      return bstr and ustr correspondingly instead of bytes and unicode. This
      is change in behaviour, but hopefully it should not break anything as
      there are not many b/u users currently and bstr and ustr are intended to
      be drop-in replacements for standard string types.
      
      Next patches will enhance bstr/ustr step by step to be actually drop-in
      replacements for standard string types for real.
      
      See nexedi/zodbtools!13 (comment 81646)
      for preliminary discussion from 2019.
      
      See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost
      overview"[2] for related presentation by Jean-Paul from 2018.
      
      [1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
      [2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
      1f99393d
  3. 05 Oct, 2022 2 commits
    • Kirill Smelkov's avatar
      py.bench: Automatically discover benchmarks in test files · ffb40903
      Kirill Smelkov authored
      Since the beginning (9bf03d9c "py.bench: New command to benchmark python
      code similarly to `go test -bench`") py.bench was automatically
      discovering benchmarks in bench_*.py files only. This was inherited from
      wendelin.core which keeps its benchmarks in those files.
      
      However in pygolang, following Go convention(*), we already have several
      benchmarks that reside together with tests in same *_test.py files. And
      currently just running py.bench does not discover them.
      
      -> Let's fix this and teach py.bench to automatically discover
      benchmarks in the test files by default as well.
      
      Pytest's default is to look for tests in test_*.py and *_test.py (+).
      Add those patterns and also keep bench_*.py for backward compatibility.
      
      Before this patch running py.bench inside pygolang repository does not
      run any benchmark at all. After the patch py.bench runs all the
      benchmarks by default:
      
          (z-dev) kirr@deca:~/src/tools/go/pygolang$ py.bench
          ========================= test session starts ==========================
          platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
          rootdir: /home/kirr/src/tools/go/pygolang
          plugins: timeout-1.4.2, profiling-1.7.0, mock-2.0.0
          collected 18 items
      
          pymod: golang/golang_str_test.py
          Benchmarkstddecode              2000000 0.756 µs/op
          Benchmarkudecode                20000   74.359 µs/op
          Benchmarkstdencode              3000000 0.327 µs/op
          Benchmarkbencode                40000   32.613 µs/op
      
          pymod: golang/golang_test.py
          Benchmarkpyx_select_nogil       500000  2.051 µs/op
          Benchmarkpyx_go_nogil           90000   12.177 µs/op
          Benchmarkpyx_chan_nogil         600000  1.826 µs/op
          Benchmarkgo                     80000   13.267 µs/op
          Benchmarkchan                   500000  2.076 µs/op
          Benchmarkselect                 300000  3.835 µs/op
          Benchmarkdef                    30000000        0.035 µs/op
          Benchmarkfunc_def               40000   29.387 µs/op
          Benchmarkcall                   30000000        0.043 µs/op
          Benchmarkfunc_call              2000000 0.819 µs/op
          Benchmarktry_finally            20000000        0.096 µs/op
          Benchmarkdefer                  600000  1.755 µs/op
      
          pymod: golang/sync_test.py
          Benchmarkworkgroup_empty        40000   25.807 µs/op
          Benchmarkworkgroup_raise        40000   31.637 µs/op                     [100%]
      
          =========================== warnings summary ===========================
      
      (*) see https://pkg.go.dev/cmd/go#hdr-Test_packages
      (+) see https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_files
      
      /reviewed-by @jerome
      /reviewed-on !20
      ffb40903
    • Kirill Smelkov's avatar
      golang_str: Speedup utf-8 decoding a bit on py2 · 9cb7b210
      Kirill Smelkov authored
      We recently moved our custom UTF-8 encoding/decoding routines to Cython.
      Now we can start taking speedup advantage on C level to make our own
      UTF-8 decoder a bit less horribly slow on py2:
      
          name       old time/op  new time/op  delta
          stddecode   752ns ± 0%   743ns ± 0%   -1.19%  (p=0.000 n=9+10)
          udecode     216µs ± 0%    75µs ± 0%  -65.19%  (p=0.000 n=9+10)
          stdencode   328ns ± 2%   327ns ± 1%     ~     (p=0.252 n=10+9)
          bencode    34.1µs ± 1%  32.1µs ± 1%   -5.92%  (p=0.000 n=10+10)
      
      So it is ~ 3x speedup for u(), but still significantly slower compared
      to std unicode.decode('utf-8').
      
      Only low-hanging fruit here to make _utf_decode_rune a bit more prompt,
      since it sits in the most inner loop. In the future
      _utf8_decode_surrogateescape might be reworked as well to avoid
      constructing resulting unicode via py-level list of py-unicode character
      objects. And similarly for _utf8_encode_surrogateescape.
      
      On py3 the performance of std and u/b decode/encode is approximately the same.
      
      /trusted-by @jerome
      /reviewed-on nexedi/pygolang!19
      9cb7b210
  4. 04 Oct, 2022 4 commits
    • Kirill Smelkov's avatar
      golang_str,strconv: Fix decoding of rune-error · 598eb479
      Kirill Smelkov authored
      Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an
      error in decoding. But the error rune itself is valid unicode codepoint:
      
         >>> x = u"�"
         >>> x
         u'\ufffd'
         >>> x.encode('utf-8')
         '\xef\xbf\xbd'
      
      This way only (r=_rune_error, size=1) should be treated by the caller as
      utf8 decoding error.
      
      But e.g. strconv.quote was not careful to also inspect the size, and this way
      was quoting � into just "\xef" instead of "\xef\xbf\xbd".
      _utf8_decode_surrogateescape was also subject to similar error.
      
      -> Fix it.
      
      Without the fix e.g. added test for strconv.quote fails as
      
          >           assert quote(tin) == tquoted
          E           assert '"\xef"' == '"�"'
          E             - "\xef"
          E             + "�"
      
      /reviewed-by @jerome
      /reviewed-at nexedi/pygolang!18
      598eb479
    • Kirill Smelkov's avatar
      golang_str: Move py3/py2 conditioning into _utf8_{encode,decode}_surrogateescape · ea5abe71
      Kirill Smelkov authored
      So that those routines could be just called and do what is expected
      without the caller caring whether it is py2 or py3. We will soon need to
      use those routines from several callsites, and having that py2/py3
      conditioning being spread over all usage places would be inconvenient.
      
      /reviewed-by @jerome
      /reviewed-at !18
      ea5abe71
    • Kirill Smelkov's avatar
      strconv: Move functionality related to UTF8 encode/decode into _golang_str · 50b8cb7e
      Kirill Smelkov authored
      - Move _utf8_decode_rune, _utf8_decode_surrogateescape, _utf8_encode_surrogateescape out from strconv into _golang_str
      - Factor _bstr/_ustr code into pyb/pyu. _bstr/_ustr become plain wrappers over pyb/pyu.
      - work-around emerged golang  strconv dependency with at-runtime import.
      
      Moved routines belong to the main part of golang strings processing
      -> their home should be in _golang_str.pyx
      
      /reviewed-by @jerome
      /reviewed-at nexedi/pygolang!18
      50b8cb7e
    • Kirill Smelkov's avatar
      golang: Move strings-related code to _golang_str "submodule" · e72a459f
      Kirill Smelkov authored
      We are going to significantly extend py-strings related functionality soon
      - to the point where amount of strings related code will be
      approximately the same compared to the amount of all other
      python-related code inside golang module.
      
      -> First move everything related to py strings to dedicated
      _golang_str.pyx as a preparatory step.
      
      Keep that new file included from _golang.pyx instead of being real new
      module, because we want strings functionality to be provided by golang
      main namespace itself, and to ease internal code interdependencies.
      
      Plain code movement.
      
      /reviewed-by @jerome
      /reviewed-at nexedi/pygolang!18
      e72a459f
  5. 26 Jan, 2022 15 commits
  6. 08 Dec, 2021 1 commit