1. 16 Nov, 2022 2 commits
  2. 25 Oct, 2022 1 commit
    • Kirill Smelkov's avatar
      golang_str: Fix bstr/ustr slice access on py2 · 300d7dfa
      Kirill Smelkov authored
      In the patch "golang_str: bstr/ustr index access" we added __getitem__
      implementation for bstr/ustr and thorough corresponding tests to cover
      all access cases: [i], [i:j] and [i:j:k].
      
      The tests, however, are run via pytest which does AST rewriting, and, as
      it turned out, always invokes __getitem__ even for [i:j] case even on py2.
      Which differs from plain python2 behaviour to invoke __getslice__ for
      [i:j] case if __getslice__ slot is present.
      
      Since on py2 both str and unicode provide __getslice__ implementation,
      and bstr/ustr inherit from those types, they also inherit __getslice__.
      And oops, then on py2 e.g. bstr[i:j] was returning str instead of bstr:
      
          In [1]: bs = b('αβγ')
      
          In [2]: bs
          Out[2]: b('αβγ')
      
          In [3]: bs[0]
          Out[3]: b(b'\xce')
      
          In [4]: bs[0:1]
          Out[4]: '\xce'              <-- NOTE not b(...)
      
          In [5]: type(_)
          Out[5]: str                 <-- NOTE not bstr
      
      -> Fix it by explicitly whiting out __getslice__ slot for bstr and ustr.
      300d7dfa
  3. 09 Oct, 2022 24 commits
    • Kirill Smelkov's avatar
      golang_str: Cosmetics · 859a55eb
      Kirill Smelkov authored
      859a55eb
    • Kirill Smelkov's avatar
      golang_str: TODO UTF-8bk · c0a53847
      Kirill Smelkov authored
      bstr and ustr currently claim, that:
      
        - bstr → ustr → bstr
          is always identity even if bytes data is not valid UTF-8,  and
      
        - ustr → bstr → ustr
          is always identity even if bytes data is not valid UTF-8.
      
      this is indeed true for any bytes data.
      
      But for some (incorrect) unicode, the conversion from ustr → bstr might
      currently fail as the following example demonstrates:
      
          # py3
          In [1]: x = u'\udc00'
      
          In [2]: x.encode('utf-8')
          UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
      
          In [3]: x.encode('utf-8', 'surrogateescape')
          UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
      
      I know how to fix this by adjusting UTF-8b(*) encoding process a bit,
      but I currently lack time to do it.
      
      -> Let's place corresponding todo entry.
      
      Please note, once again, that for arbitrary bytes input the conversion
      from bstr → ustr → bstr always succeeds and works ok already. And it is
      this particular conversion that is most relevant in practice.
      
      (*) aka surrogateescape in python speak. See
      http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html
      for original explanation from 2000.
      c0a53847
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr encode/decode · 023907ee
      Kirill Smelkov authored
      So far we've overridden almost all string methods, that bstr/ustr
      inherited from bytes and unicode. However 2 of the methods remained
      intact until now: unicode.encode() and bytes.decode(). Let's override
      them too for completeness:
      
      - we want ustr.encode() to follow signature of unicode.encode and for ustr.encode('utf-8') to return bstr.
      - for consistency we also want ustr.encode() to return the same type
        irregardless of which encoding/errors pair is used in the arguments.
      - => ustr.encode() always returns bstr.
      - we want bstr.decode() to follow signature of bytes.decode and for bstr.decode('utf-8') to return ustr.
      - for consistency we also want bstr.decode() to return the same type
        irregardless of which encoding/errors pair is used in the arguments.
      - -> bstr.decode() always returns ustr.
      
      So  ustr.encode() -> bstr  and  bstr.decode() -> ustr.
      
      Let's implement this carrying out encoding/decoding process internally
      similarly to regular bytes and unicode and wrapping the result into
      corresponding pygolang type at the end.
      023907ee
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr .format() support · 0985c583
      Kirill Smelkov authored
      Similarly to %-formatting, let's add support for .format(). This is
      easier to do because we can leverage string.Formatting and hook into the
      process by proper subclassing. We do not need to implement parsing and
      need to only customize handling of 's' and 'r' specifiers.
      
      For testing we mostly reuse existing tests for %-formatting by amending
      them a bit to exercise both %-formatting and format-formatting at the
      same time: by converting %-format specification into corresponding
      {}-format specification and verifying formatting result for that to be
      as expected.
      
      Some explicit tests for {}-style .format() are also added.
      0985c583
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr %-formatting · 390fd810
      Kirill Smelkov authored
      Teach bstr/ustr to do % formatting similarly to how unicode does, but
      with treating bytes as UTF8-encoded strings - all in line with
      general idea for bstr/ustr to treat bytes as strings.
      
      The following approach is used to implement this:
      
      1. both bstr and ustr format via bytes-based _bprintf.
      2. we parse the format string and handle every formatting specifier separately:
      3. for formats besides %s/%r we use bytes.__mod__ directly.
      
      4. for %s we stringify corresponding argument specially with all, potentially
         internal, bytes instances treated as UTF8-encoded strings:
      
            '%s' % b'\xce\xb2'      ->  "β"
            '%s' % [b'\xce\xb2']    ->  "['β']"
      
      5. for %r, similarly to %s, we prepare repr of corresponding argument
         specially with all, potentially internal, bytes instances also treated as
         UTF8-encoded strings:
      
            '%r' % b'\xce\xb2'      ->  "b'β'"
            '%r' % [b'\xce\xb2']    ->  "[b'β']"
      
      For "2" we implement %-format parsing ourselves. test_strings_mod
      has good coverage for this phase to make sure we get it right and behaving
      exactly the same way as standard Python does.
      
      For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
      from under bstr.__mod__(). See _bstringify for details.
      
      For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
      See _bstringify_repr for details.
      
      I initially tried to avoid parsing format specification myself and
      wanted to reuse original bytes.__mod__ and just adjust its behaviour
      a bit somehow. This did not worked quite right as the following comment
      explains:
      
          # Rejected alternative: try to format; if we get "TypeError: %b requires a
          # bytes-like object ..." retry with that argument converted to bstr.
          #
          # Rejected because e.g. for  `%(x)s %(x)r` % {'x': obj}`  we need to use
          # access number instead of key 'x' to determine which accesses to
          # bstringify. We could do that, but unfortunately on Python2 the access
          # number is not easily predictable because string could be upgraded to
          # unicode in the midst of being formatted and so some access keys will be
          # accesses not once.
          #
          # Another reason for rejection: b'%r' and u'%r' handle arguments
          # differently - on b %r is aliased to %a.
      
      That's why full %-format parsing and handling is implemented in this
      patch. Once again to make sure its behaviour is really the same compared
      to Python's builtin %-formatting, we have good test coverage for both
      %-format parsing itself, and for actual formatting of many various cases.
      
      See test_strings_mod for details.
      390fd810
    • Kirill Smelkov's avatar
      golang_str: Teach bstr/ustr to stringify bytes as UTF-8 bytestrings even inside containers · ddf6958b
      Kirill Smelkov authored
      bstr/ustr constructors either convert or stringify its argument. For
      example bstr(u'α') gives b('α') while bstr(1) gives b('1'). And if the
      argument is bytes, bstr treats it as UTF-8 encoded bytestring:
      
          >>> x = u'β'.encode()
          >>> x
          b'\xce\xb2'
          >>> bstr(x)
          b('β')
      
      however if that same bytes argument is placed inside container - e.g. inside
      list - currently it is not stringified as bytestring:
      
          >>> bstr([x])
          b("[b'\\xce\\xb2']")	<-- NOTE not b("['β']")
      
      which is not consistent with our intended approach that bstr/ustr treat
      bytes in their arguments as UTF-8 encoded strings.
      
      This happens because when a list is stringified, list.__str__
      implementation goes through its arguments and invokes __repr__ of the
      arguments. And in general a container might be arbitrary deep, e.g. dict
      -> list -> list -> bytes, and even when stringifying that deep dict, we
      want to handle that leaf bytes as UTF-8 encoded string.
      
      There are many containers in Python - lists, tuples, dicts,
      collections.OrderedDict, collections.UserDict, collections.namedtuple,
      collections.defaultdict, etc, and also there are many user-defined
      containers - including implemented at C level - which we can not even
      know all in advance.
      
      It means that we cannot do some, probably deep/recursive typechecking,
      inside bstringify and implement kind of parallel stringification of
      arbitrary complex structure with adjustment to stringification of bytes.
      We cannot also create object clone - for stringification - with bytes
      instances replaced with str (e.g. via DeepReplacer - see recent previous
      patch), and then stringify the clone. That would generally be incorrect,
      because in this approach we cannot know whether an object is being
      stringified as it is, or whether it is being used internally for data
      storage and is not stringified directly. In the latter case if we
      replace bytes with unicode, it might break internal invariant of custom
      container class and break its logic.
      
      What we can do however, is to hook into bytes.__repr__ implementations,
      and to detect - if this implementation is called from under bstringify -
      then we know we should adjust it and treat this bytes as bytestring.
      Else - use original bytes.__repr__ implementation. This way we can handle
      arbitrary complex data structures.
      
      Hereby patch implements that approach for bytes, unicode on py2, and for
      bytearray. See added comments that start with
      
          # patch bytes.{__repr__,__str__} and ...
      
      for details.
      
      After this patch stringification of bytes inside containers treat them
      as UTF-8 bytestrings:
      
          >>> bstr([x])
          b("['β']")
      ddf6958b
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr string methods · ff24be3d
      Kirill Smelkov authored
      Take all str/unicode methods, such as .capitalize(), .split(), .join(),
      etc, and implement them for bstr/ustr. For example bstr.split() behaves
      like unicode.split(), but returns list of bstr instead of list of
      unicode. And similarly for all other methods.
      
      Organize testing of this via verifying every method behaviour on all
      unicode and bstr/ustr. If the results match by modulo of deep replacing
      unicode to bstr/ustr - everything is ok.
      ff24be3d
    • Kirill Smelkov's avatar
      golang_str: tests: Deep replacer · 2c20c055
      Kirill Smelkov authored
      deepReplace returns object's clone with replacing all internal objects
      selected by predicate via provided replacement function. We will use
      this functionality in the following patches to organize testing of
      bstr/ustr methods: a method would be first invoked on regular str, and
      then on bstr/ustr and the result will be compared against each other.
      The results are usually different, because e.g. u'a b c'.split() returns
      [u'a', u'b', u'c'] while b('a b c').split() should return
      [b('a'), b('b'), b('c')]. We want to make sure that the second result is
      exactly the first result with all instances of unicode replaced by bstr.
      That's where deep replacer will be used.
      
      The deep replacement itself is implemented via pickle reduce/rebuild
      protocol: we unassemble and reconstruct objects. And while an object is
      unassembled, we try to apply the replacement recursively. Since this is
      not so trivial functionality, it itself also comes with a test.
      2c20c055
    • Kirill Smelkov's avatar
      golang_str: Fix bstr.tp_print(flags=print_repr) · 510cf8d1
      Kirill Smelkov authored
      On py2 objects are printed via their .tp_repr slot with flags=0
      (contrary to Py_PRINT_RAW which requests to print str -
      https://docs.python.org/2.7/c-api/object.html#c.PyObject_Print)
      
      We were not handling repr'ing inside our tp_print implementation, and
      as the result e.g. b('мир') was printed on interactive console as
      '\xd0\xbc\xd0\xb8\xd1\x80' instead of b('мир').
      
      Fix it.
      510cf8d1
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr repr · 386844d3
      Kirill Smelkov authored
      Teach bstr/ustr to provide repr of themselves: it goes as b(...) and
      u(...) where u stands for human-readable repr of contained data.
      Human-readable means that non-ascii printable unicode characters are
      shown as-is instead of escaping them, for example:
      
          >>> x = u'αβγ'
          >>> x
          'αβγ'
          >>> y = b(x)
          >>> y
          b('αβγ')				<-- NOTE not b(b'\xce\xb1\xce\xb2\xce\xb3')
          >>> x.encode('utf-8')
          b'\xce\xb1\xce\xb2\xce\xb3'
      386844d3
    • Kirill Smelkov's avatar
      strconv, golang_str: Switch quote, unquote and qq to always return bstr · 604a7765
      Kirill Smelkov authored
      bstr is becoming the default pygolang string type. And it can be mixed
      ok with all bytes/unicode and ustr. Previously e.g. strconv.quote was
      checking which kind of type its input was and was trying to return the
      result of the same type. Now this becomes unnecessary since bstr is
      intended to be used universally and interoperable with all other string
      types.
      604a7765
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr support for + and * · bbbb58f0
      Kirill Smelkov authored
      Add support for +, *, += and *= operators to bstr and ustr.
      
      For * rhs should be integer and the result, similarly to std strings, is
      repetition of rhs times.
      
      For + the other argument could be any supported string - bstr/ustr /
      unicode/bytes/bytearray. And the result is always bstr or ustr:
      
          u()   +     *     ->  u()
          b()   +     *     ->  b()
          u''   +  u()/b()  ->  u()
          u''   +  u''      ->  u''
          b''   +  u()/b()  ->  b()
          b''   +      b''  ->  b''
          barr  +  u()/b()  ->  barr
      
      in particular if lhs is bstr or ustr, the result will remain exactly of
      original lhs type. This should be handy when one has e.g. bstr at hand
      and wants to incrementally append something to it.
      
      And if lhs is bytes/unicode, but we append bstr/ustr to it, we "upgrade"
      the result to bstr/ustr correspondingly. Only if lhs is bytearray it
      remains to stay that way because it is logical for appended object to
      remain mutable if it was mutable in the beginning.
      
      As before bytearray.__add__ and friends need to patched a bit for
      bytearray not to reject ustr.
      bbbb58f0
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr pickle support · ebd18f3f
      Kirill Smelkov authored
      Without explicitly overriding __reduce_ex__ pickling was failing for
      protocols < 2:
      
          _________________________ test_strings_pickle __________________________
      
              def test_strings_pickle():
                  bs = b("мир")
                  us = u("май")
      
                  #from pickletools import dis
                  for proto in range(0, pickle.HIGHEST_PROTOCOL):
          >           p_bs = pickle.dumps(bs, proto)
      
          golang/golang_str_test.py:282:
          _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
      
          self = b'\xd0\xbc\xd0\xb8\xd1\x80', proto = 0
      
              def _reduce_ex(self, proto):
          >       assert proto < 2
          E       RecursionError: maximum recursion depth exceeded in comparison
      
          /usr/lib/python3.9/copyreg.py:56: RecursionError
      
      See added comments for details.
      ebd18f3f
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr iteration · a72c1c1a
      Kirill Smelkov authored
      Even though bstr is semantically array of bytes, while ustr is array of
      unicode characters, iterating them _both_ yields unicode characters.
      This goes in line with Go approach described in "Strings, bytes, runes
      and characters in Go"[1] and allows for both ustr _and_ bstr to be used
      as strings in unicode world.
      
      Even though this diverges (just a bit) from str/py2 str behaviur, and
      diverges more from bytes/py3 behaviour, I have not hit any problem in
      practice due to this divergence. In other words the semantics of
      bytestring used in Go - to iterate them as unicode characters - is
      sound. For the reference it is the authors of Go who originally invented
      UTF-8 - see [2] for details.
      
      See also [3] for our discussion with Jérome on this topic.
      
      [1] https://blog.golang.org/strings
      [2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
      [3] nexedi/zodbtools!13 (comment 81646)
      a72c1c1a
    • Kirill Smelkov's avatar
      golang_str: bstr/ustr index access · 04be919b
      Kirill Smelkov authored
      Implement access to bstr/ustr by [index] and by slice. Result of such
      [index] access - similarly to standard str - returns the same bstr/ustr
      type with one character:
      
        - ustr[i] returns ustr with one unicode character taken from i'th character of original string, while
        - bstr[i] returns bstr with one byte taken from i'th byte of original bytestring.
      
      This follows str/unicode semantics on both py2/py3, bytes semantic on
      py2, but diverges from bytes semantics on py3. I originally tried to
      follow bytes/py3 semantic - for bstr to return an integer instead of
      1-byte character, but later found several compatibility breakages due to
      it. I contemplated about this divergence for a long time and finally
      took decision to follow strings semantics for both ustr and bstr. This
      preserves backward compatibility with Python2 and also allows for bstr
      to be practically drop-in replacement for str type.
      
      To get an ordinal corresponding to retrieved character, one can use
      standard `ord`, e.g. as in `ord(bstr[i])`. This will always return an
      integer for all bstr/ustr/str/unicode. Similarly to standard `chr` and
      `unichr`, we also provide two utility functions - `uchr` and `bbyte` to
      create 1-character and 1-byte ustr/bstr correspondingly.
      04be919b
    • Kirill Smelkov's avatar
      golang_str: Add test for memoryview(bstr) · 105d03d4
      Kirill Smelkov authored
      Verify that it works as expected, and that memoryview(ustr) is rejected,
      because ustr is semantically array of unicode characters, not bytes.
      
      No change to the code - just add tests for current status which is
      already working as expected.
      105d03d4
    • Kirill Smelkov's avatar
      golang_str: Teach b/u to accept objects with buffer interface · d7e55bb0
      Kirill Smelkov authored
      And to convert them to bstr/ustr decoding buffer data as if it was
      bytes. This is needed if e.g. we have data in mmap or numpy.ndarray, and
      want to convert the data to string. The conversion is always explicit via
      explicit call to b/u. And for bstr/ustr constructors, we preserver their
      behaviour to match unicode constructor not to convert automatically, but
      instead to stringify the object, e.g. as shown below:
      
          In [1]: bdata = b'hello 123'
      
          In [2]: mview = memoryview(bdata)
      
          In [3]: str(mview)
          Out[3]: '<memory at 0x7fb226b26700>'	# NOTE _not_ b'hello 123'
      d7e55bb0
    • Kirill Smelkov's avatar
      golang_str: Treat bytearray also as bytestring, just mutable · e4d5cb21
      Kirill Smelkov authored
      bytearray was introduced in Python as a mutable version of bytes. It has
      all strings methods (e.g. .capitalize() .islower(), etc), and it also
      supports % formatting. In other words it has all attributes of being a
      byte-string, with the only difference from bytes in that bytearray is
      mutable. In other words bytearray is handy to have when a string is
      being incrementally constructed step by step without hitting overhead of
      many bytes objects creation/destruction.
      
      So, since bytearray is also a bytestring, similarly to bytes, let's add
      support to interoperate with bytearray to bstr and ustr:
      
      - b/u and bstr/ustr now accept bytearray as argument and treat it as bytestring.
      - bytearray() constructor, similarly to bytes() and unicode()
        constructors, now also accepts bstr/ustr and create bytearray object
        corresponding to byte-stream of input.
      
      For the latter point to work we need to patch bytearray.__init__() a bit,
      since, contrary to bytes.__init__(), it does not pay attention to
      whether provided argument has __bytes__ method or not.
      e4d5cb21
    • Kirill Smelkov's avatar
      golang_str: Implement bstr/ustr constructors · 781802d4
      Kirill Smelkov authored
      Both bstr and ustr constructors mimic constructor of unicode(= str on py3) -
      an object is either stringified, or decoded if it provides buffer
      interface, or the constructor is invoked with optional encoding and
      errors argument:
      
          # py2
          class unicode(basestring)
           |  unicode(object='') -> unicode object
           |  unicode(string[, encoding[, errors]]) -> unicode object
      
          # py3
          class str(object)
           |  str(object='') -> str
           |  str(bytes_or_buffer[, encoding[, errors]]) -> str
      
      Stringification of all bstr/ustr / unicode/bytes is handled
      automatically with the meaning to convert to created type via b or u.
      
      We follow unicode semantic for both ustr _and_ bstr, because bstr/ustr
      are intended to be used as strings.
      781802d4
    • Kirill Smelkov's avatar
      golang_str: Teach bstr/ustr to compare wrt any string with automatic coercion · 54c2a3cf
      Kirill Smelkov authored
      So that e.g. `bstr == <any string type>` works. We want `bstr == ustr`
      to work because we intend those types to be interoperable. We also want
      e.g. `bstr == "a_string"` to work because we want bstr to be
      interoperable with standard strings. In general we want to have full
      automatic interoperability with all string types, so that e.g. `bstr == X`
      works for X being all bstr, ustr, unicode, bytes (and later bytearray).
      
      For now we add support only for comparison operators. But later, we
      will be adding support for e.g. +, string methods, etc - and in all
      those operations we will be following the same approach: to have
      automatic interoperability with all string types out of the box.
      
      The text added to README reflects this.
      
      The patch to unicode.tp_richcompare on py2 illustrates our approach to
      adjust builtin types when absolutely needed. In this particular case
      original builtin unicode.__eq__(unicode, bstr) is always returning False
      for non-ASCII bstr even despite bstr having .__unicode__() method. Our
      adjustment is non-intrusive - we adjust unicode behaviour only wrt bstr
      and it stays exactly the same as before wrt all other types.
      
      We anyway do that with care and add a test that verifies that behaviour
      of what we patched stays unaffected when used outside of bstr/ustr
      context.
      54c2a3cf
    • Kirill Smelkov's avatar
      golang_str: Infrastructure to patch builtin types · 34667355
      Kirill Smelkov authored
      _patch_slot(typ, slotname, func) installs func into typ's
      dict[slotname]. For example in the next patch we will need to adjust
      unicode.__eq__ on py2 not to reject bstr with always assuming that
      `unicode == bstr` is False. We will do it via patching unicode.__eq__ to
      first check rhs or whether it is bstr and handling that with our code,
      while tailing to original unicode.__eq__ for all other types.
      34667355
    • Kirill Smelkov's avatar
      golang_str: Refresh b/u and bstr/ustr docstrings · 88b21b40
      Kirill Smelkov authored
      Document explicitly which types b/u accept and how they are handled.
      Change bstr/ustr docstrings to also be more explicit.
      
      Documentation changes only.
      88b21b40
    • Kirill Smelkov's avatar
      golang_str: Make bytes(bstr) -> bstr, unicode(ustr) -> ustr · b7cda092
      Kirill Smelkov authored
      In other words casting to bytes/unicode preserves pygolang string to
      remain pygolang string.
      
      Without the changes to bstr/ustr added test fails as e.g.
      
          >       assert bytes  (bs) is bs
          E       AssertionError: assert b'\xd0\xbc\xd0\xb8\xd1\x80' is b'\xd0\xbc\xd0\xb8\xd1\x80'
          E        +  where b'\xd0\xbc\xd0\xb8\xd1\x80' = bytes(b'\xd0\xbc\xd0\xb8\xd1\x80')
      
      in other words bytes(bstr) was creating a copy and changing type to bytes.
      b7cda092
    • Kirill Smelkov's avatar
      golang_str: Extend tests a bit · 85c4615d
      Kirill Smelkov authored
      Extend current coverage for b/u tests more explicitly verifying
      resulting type (`type(·) is ...` instead of `isinstance(·, ...)`),
      verifying unicode(bstr)->ustr and bytes(ustr)->bstr, and str() of both
      bstr and ustr.
      
      Move the check for "no custom attributes" from test_qq to generic
      test_strings_basic, because now verified string types are publicly
      accessible, not only via qq.
      
      Small cosmetics in benchmarks - by reusing hereby introduced xbytes()
      utility.
      
      No change for the code itself - the tests just add verification to
      current status.
      85c4615d
  4. 08 Oct, 2022 1 commit
    • Kirill Smelkov's avatar
      golang_str: Start exposing Pygolang string types publicly · 1f99393d
      Kirill Smelkov authored
      In 2020 in edc7aaab (golang: Teach qq to be usable with both bytes and
      str format whatever type qq argument is) I added custom bytes- and
      unicode- like types for qq to return instead of str with the idea for
      qq's result to be interoperable with both bytes and unicode. Citing that patch:
      
          qq is used to quote strings or byte-strings. The following example
          illustrates the problem we are currently hitting in zodbtools with
          Python3:
      
              >>> "hello %s" % qq("мир")
              'hello "мир"'
      
              >>> b"hello %s" % qq("мир")
              Traceback (most recent call last):
                File "<stdin>", line 1, in <module>
              TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'
      
              >>> "hello %s" % qq(b("мир"))
              'hello "мир"'
      
              >>> b"hello %s" % qq(b("мир"))
              Traceback (most recent call last):
                File "<stdin>", line 1, in <module>
              TypeError: %b requires a bytes-like object, or an object that implements __bytes__, not 'str'
      
          i.e. one way or another if type of format string and what qq returns do not
          match it creates a TypeError.
      
          We want qq(obj) to be useable with both string and bytestring format.
      
          For that let's teach qq to return special str- and bytes- derived types that
          know how to automatically convert to str->bytes and bytes->str via b/u
          correspondingly. This way formatting works whatever types combination it was
          for format and for qq, and the whole result has the same type as format.
      
          For now we teach only qq to use new types and don't generally expose
          _str and _unicode to be returned by b and u yet. However we might do so
          in the future after incrementally gaining a bit more experience.
      
      So two years later I gained that experience and found that having string
      type, that can interoperate with both bytes and unicode, is generally
      useful. It is useful for practical backward compatibility with Python2
      and for simplicity of programming avoiding constant stream of
      encode/decode noise. Thus the day to expose Pygolang string types for
      general use has come.
      
      This patch does the first small step: it exposes bytes- and unicode-
      like types (now named as bstr and ustr) publicly. It switches b and u to
      return bstr and ustr correspondingly instead of bytes and unicode. This
      is change in behaviour, but hopefully it should not break anything as
      there are not many b/u users currently and bstr and ustr are intended to
      be drop-in replacements for standard string types.
      
      Next patches will enhance bstr/ustr step by step to be actually drop-in
      replacements for standard string types for real.
      
      See nexedi/zodbtools!13 (comment 81646)
      for preliminary discussion from 2019.
      
      See also "Python 3 Losses: Nexedi Perspective"[1] and associated "cost
      overview"[2] for related presentation by Jean-Paul from 2018.
      
      [1] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
      [2] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
      1f99393d
  5. 05 Oct, 2022 2 commits
    • Kirill Smelkov's avatar
      py.bench: Automatically discover benchmarks in test files · ffb40903
      Kirill Smelkov authored
      Since the beginning (9bf03d9c "py.bench: New command to benchmark python
      code similarly to `go test -bench`") py.bench was automatically
      discovering benchmarks in bench_*.py files only. This was inherited from
      wendelin.core which keeps its benchmarks in those files.
      
      However in pygolang, following Go convention(*), we already have several
      benchmarks that reside together with tests in same *_test.py files. And
      currently just running py.bench does not discover them.
      
      -> Let's fix this and teach py.bench to automatically discover
      benchmarks in the test files by default as well.
      
      Pytest's default is to look for tests in test_*.py and *_test.py (+).
      Add those patterns and also keep bench_*.py for backward compatibility.
      
      Before this patch running py.bench inside pygolang repository does not
      run any benchmark at all. After the patch py.bench runs all the
      benchmarks by default:
      
          (z-dev) kirr@deca:~/src/tools/go/pygolang$ py.bench
          ========================= test session starts ==========================
          platform linux2 -- Python 2.7.18, pytest-4.6.11, py-1.10.0, pluggy-0.13.1
          rootdir: /home/kirr/src/tools/go/pygolang
          plugins: timeout-1.4.2, profiling-1.7.0, mock-2.0.0
          collected 18 items
      
          pymod: golang/golang_str_test.py
          Benchmarkstddecode              2000000 0.756 µs/op
          Benchmarkudecode                20000   74.359 µs/op
          Benchmarkstdencode              3000000 0.327 µs/op
          Benchmarkbencode                40000   32.613 µs/op
      
          pymod: golang/golang_test.py
          Benchmarkpyx_select_nogil       500000  2.051 µs/op
          Benchmarkpyx_go_nogil           90000   12.177 µs/op
          Benchmarkpyx_chan_nogil         600000  1.826 µs/op
          Benchmarkgo                     80000   13.267 µs/op
          Benchmarkchan                   500000  2.076 µs/op
          Benchmarkselect                 300000  3.835 µs/op
          Benchmarkdef                    30000000        0.035 µs/op
          Benchmarkfunc_def               40000   29.387 µs/op
          Benchmarkcall                   30000000        0.043 µs/op
          Benchmarkfunc_call              2000000 0.819 µs/op
          Benchmarktry_finally            20000000        0.096 µs/op
          Benchmarkdefer                  600000  1.755 µs/op
      
          pymod: golang/sync_test.py
          Benchmarkworkgroup_empty        40000   25.807 µs/op
          Benchmarkworkgroup_raise        40000   31.637 µs/op                     [100%]
      
          =========================== warnings summary ===========================
      
      (*) see https://pkg.go.dev/cmd/go#hdr-Test_packages
      (+) see https://docs.pytest.org/en/7.1.x/reference/reference.html#confval-python_files
      
      /reviewed-by @jerome
      /reviewed-on nexedi/pygolang!20
      ffb40903
    • Kirill Smelkov's avatar
      golang_str: Speedup utf-8 decoding a bit on py2 · 9cb7b210
      Kirill Smelkov authored
      We recently moved our custom UTF-8 encoding/decoding routines to Cython.
      Now we can start taking speedup advantage on C level to make our own
      UTF-8 decoder a bit less horribly slow on py2:
      
          name       old time/op  new time/op  delta
          stddecode   752ns ± 0%   743ns ± 0%   -1.19%  (p=0.000 n=9+10)
          udecode     216µs ± 0%    75µs ± 0%  -65.19%  (p=0.000 n=9+10)
          stdencode   328ns ± 2%   327ns ± 1%     ~     (p=0.252 n=10+9)
          bencode    34.1µs ± 1%  32.1µs ± 1%   -5.92%  (p=0.000 n=10+10)
      
      So it is ~ 3x speedup for u(), but still significantly slower compared
      to std unicode.decode('utf-8').
      
      Only low-hanging fruit here to make _utf_decode_rune a bit more prompt,
      since it sits in the most inner loop. In the future
      _utf8_decode_surrogateescape might be reworked as well to avoid
      constructing resulting unicode via py-level list of py-unicode character
      objects. And similarly for _utf8_encode_surrogateescape.
      
      On py3 the performance of std and u/b decode/encode is approximately the same.
      
      /trusted-by @jerome
      /reviewed-on !19
      9cb7b210
  6. 04 Oct, 2022 4 commits
    • Kirill Smelkov's avatar
      golang_str,strconv: Fix decoding of rune-error · 598eb479
      Kirill Smelkov authored
      Error rune (u+fffd) is returned by _utf8_decode_rune to indicate an
      error in decoding. But the error rune itself is valid unicode codepoint:
      
         >>> x = u"�"
         >>> x
         u'\ufffd'
         >>> x.encode('utf-8')
         '\xef\xbf\xbd'
      
      This way only (r=_rune_error, size=1) should be treated by the caller as
      utf8 decoding error.
      
      But e.g. strconv.quote was not careful to also inspect the size, and this way
      was quoting � into just "\xef" instead of "\xef\xbf\xbd".
      _utf8_decode_surrogateescape was also subject to similar error.
      
      -> Fix it.
      
      Without the fix e.g. added test for strconv.quote fails as
      
          >           assert quote(tin) == tquoted
          E           assert '"\xef"' == '"�"'
          E             - "\xef"
          E             + "�"
      
      /reviewed-by @jerome
      /reviewed-at nexedi/pygolang!18
      598eb479
    • Kirill Smelkov's avatar
      golang_str: Move py3/py2 conditioning into _utf8_{encode,decode}_surrogateescape · ea5abe71
      Kirill Smelkov authored
      So that those routines could be just called and do what is expected
      without the caller caring whether it is py2 or py3. We will soon need to
      use those routines from several callsites, and having that py2/py3
      conditioning being spread over all usage places would be inconvenient.
      
      /reviewed-by @jerome
      /reviewed-at !18
      ea5abe71
    • Kirill Smelkov's avatar
      strconv: Move functionality related to UTF8 encode/decode into _golang_str · 50b8cb7e
      Kirill Smelkov authored
      - Move _utf8_decode_rune, _utf8_decode_surrogateescape, _utf8_encode_surrogateescape out from strconv into _golang_str
      - Factor _bstr/_ustr code into pyb/pyu. _bstr/_ustr become plain wrappers over pyb/pyu.
      - work-around emerged golang  strconv dependency with at-runtime import.
      
      Moved routines belong to the main part of golang strings processing
      -> their home should be in _golang_str.pyx
      
      /reviewed-by @jerome
      /reviewed-at nexedi/pygolang!18
      50b8cb7e
    • Kirill Smelkov's avatar
      golang: Move strings-related code to _golang_str "submodule" · e72a459f
      Kirill Smelkov authored
      We are going to significantly extend py-strings related functionality soon
      - to the point where amount of strings related code will be
      approximately the same compared to the amount of all other
      python-related code inside golang module.
      
      -> First move everything related to py strings to dedicated
      _golang_str.pyx as a preparatory step.
      
      Keep that new file included from _golang.pyx instead of being real new
      module, because we want strings functionality to be provided by golang
      main namespace itself, and to ease internal code interdependencies.
      
      Plain code movement.
      
      /reviewed-by @jerome
      /reviewed-at !18
      e72a459f
  7. 26 Jan, 2022 6 commits
    • Kirill Smelkov's avatar
      pygolang v0.1 · 7b72d418
      Kirill Smelkov authored
      7b72d418
    • Kirill Smelkov's avatar
      golang: Fix print(_pystr) · 08dc5d10
      Kirill Smelkov authored
      On Python2 without .tp_print printing _pystr crashes as:
      
          pygolang$ ./golang/testprog/golang_test_str.py
          Traceback (most recent call last):
            File "./golang/testprog/golang_test_str.py", line 39, in <module>
              main()
            File "./golang/testprog/golang_test_str.py", line 34, in main
              print("print(qq(b)):", qq(sb))
          RuntimeError: print recursion
      
      See added comments for details.
      08dc5d10
    • Kirill Smelkov's avatar
      os += ReadFile · 2a35ef5b
      Kirill Smelkov authored
      Add convenient utility to read whole file and return its content
      similarly to Go. The code is taken from wendelin.core:
      
      https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.cpp#L246-281
      2a35ef5b
    • Kirill Smelkov's avatar
      Nogil signals · e18adbab
      Kirill Smelkov authored
      Provide os/signal package that can be used to setup signal delivery to nogil
      channels. This way for user code signal handling becomes regular handling of a
      signalling channel instead of being something special or limited to only-main
      python thread. The rationale for why we need it is explained below:
      
      There are several problems with regular python's stdlib signal module:
      
      1. Python2 does not call signal handler from under blocked lock.acquire.
         This means that if the main thread is blocked waiting on a semaphore,
         signal delivery will be delayed indefinitely, similarly to e.g. problem
         described in nxdtest!14 (comment 147527)
         where raising KeyboardInterrupt is delayed after SIGINT for many,
         potentially unbounded, seconds until ~semaphore wait finishes.
      
         Note that Python3 does not have this problem wrt stdlib locks and
         semaphores, but read below for the next point.
      
      2. all pygolang communication operations (channels send/recv, sync.Mutex,
         sync.RWMutex, sync.Sema, sync.WaitGroup, sync.WorkGroup, ...) run with
         GIL released, but if blocked do not handle EINTR and do not schedule
         python signal handler to run (on main thread).
      
         Even if we could theoretically adjust this behaviour of pygolang at python
         level to match Python3, there are also C++ and pyx/nogil worlds. And we want gil
         and nogil worlds to interoperate (see https://pypi.org/project/pygolang/#cython-nogil-api),
         so that e.g. if completely nogil code happens to run on the main thread,
         signal handling is still possible, even if that signal handling was setup at
         python level.
      
      With signals delivered to nogil channels both nogil world and python
      world can setup signal handlers and to be notified of them irregardles
      of whether main python thread is currently blocked in nogil wait or not.
      
      /reviewed-on !17
      e18adbab
    • Kirill Smelkov's avatar
      golang: Provide __pystr internally · ce507f4e
      Kirill Smelkov authored
      To convert an object to str of current python.
      It will be handy to use __pystr when implementing __str__ methods.
      
      /reviewed-on !17
      ce507f4e
    • Kirill Smelkov's avatar
      Nogil IO · 4690460b
      Kirill Smelkov authored
      Provide C++ package "os" with File, Pipe, etc similarly to what is
      provided on Go side. The package works through IO methods provided by
      runtimes.
      
      We need IO facility because os/signal package will need to use
      pipe in cooperative IO mode in its receiving-loop goroutine.
      
      os.h and os.cpp are based on drafts from wendelin.core:
      
      https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.h
      https://lab.nexedi.com/nexedi/wendelin.core/blob/wendelin.core-2.0.alpha1-18-g38dde766/wcfs/client/wcfs_misc.cpp
      
      /reviewed-on nexedi/pygolang!17
      4690460b