X golang_str: Fix iter(bstr) to yield byte instead of unicode character

Things were initially implemented to follow Go semantic exactly with bytestring iteration yielding unicode characters as explained in https://blog.golang.org/strings. However this makes bstr not a 100% drop-in compatible replacement for std str under py2, and even though my initial testing was saying this change does not affect programs in practice it turned out to be not the case. For example with bstr.__iter__ yielding unicode characters running gpython on py2 will break sometimes when importing uuid: There uuid reads 16 bytes from /dev/random and then wants to iterate those 16 bytes as single bytes and then expects that the length of the resulting sequence is exactly 16: int = long(('%02x'*16) % tuple(map(ord, bytes)), 16) ( https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/uuid.py#L147 ) which breaks if some of the read bytes are higher than 0x7f. Even though this particular problem could be worked-around with patching uuid, there is no evidence that there will be no similar problems later, which could be many. -> So adjust bstr semantic instead to follow semantic of str under py2 and introduce uiter() primitive to still be able to iterate bytestrings as unicode characters. This makes bstr, hopefully, to be fully compatible with str on py2 while still providing reasonably good approach for strings processing the Go-way when needed. Add biter as well for symmetry.

X golang_str: Fix iter(bstr) to yield byte instead of unicode character
Things were initially implemented to follow Go semantic exactly with bytestring iteration yielding unicode characters as explained in https://blog.golang.org/strings. However this makes bstr not a 100% drop-in compatible replacement for std str under py2, and even though my initial testing was saying this change does not affect programs in practice it turned out to be not the case. For example with bstr.__iter__ yielding unicode characters running gpython on py2 will break sometimes when importing uuid: There uuid reads 16 bytes from /dev/random and then wants to iterate those 16 bytes as single bytes and then expects that the length of the resulting sequence is exactly 16: int = long(('%02x'*16) % tuple(map(ord, bytes)), 16) ( https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Lib/uuid.py#L147 ) which breaks if some of the read bytes are higher than 0x7f. Even though this particular problem could be worked-around with patching uuid, there is no evidence that there will be no similar problems later, which could be many. -> So adjust bstr semantic instead to follow semantic of str under py2 and introduce uiter() primitive to still be able to iterate bytestrings as unicode characters. This makes bstr, hopefully, to be fully compatible with str on py2 while still providing reasonably good approach for strings processing the Go-way when needed. Add biter as well for symmetry.
cb0e6055 · Kirill Smelkov · 2bb971ba · cb0e6055 · cb0e6055 · cb0e6055
Commit cb0e6055 authored May 07, 2024 by Kirill Smelkov
5 changed files
--- a/README.rst
+++ b/README.rst
@@ -241,12 +241,16 @@ The conversion, in both encoding and decoding, never fails and never looses
 information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
 even if bytes data is not valid UTF-8.
+Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.
 Semantically `bstr` is array of bytes, while `ustr` is array of
-unicode-characters. Accessing their elements by `[index]` yields byte and
+unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
-unicode character correspondingly [*]_. Iterating them, however, yields unicode
+unicode character correspondingly [*]_. However it is possible to yield unicode
-characters for both `bstr` and `ustr`. In practice `bstr` is enough 99% of the
+character when iterating `bstr` via `uiter`, and to yield byte character when
-time, and `ustr` only needs to be used for random access to string characters.
+iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of
-See `Strings, bytes, runes and characters in Go`__ for overview of this approach.
+the time, and `ustr` only needs to be used for random access to string
+characters.  See `Strings, bytes, runes and characters in Go`__ for overview of
+this approach.
 __ https://blog.golang.org/strings
@@ -267,7 +271,7 @@ Usage example::
   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
   s += ' мир'          # s is b('привет мир')
-   for c in s:          # c will iterate through
+   for c in uiter(s):   # c will iterate through
        ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
   # the following gives b('привет мир труд май')

--- a/golang/__init__.py
+++ b/golang/__init__.py
 # -*- coding: utf-8 -*-
-# Copyright (C) 2018-2023  Nexedi SA and Contributors.
+# Copyright (C) 2018-2024  Nexedi SA and Contributors.
 #                          Kirill Smelkov <kirr@nexedi.com>
 #
 # This program is free software: you can Use, Study, Modify and Redistribute
@@ -24,7 +24,7 @@
 - `func` allows to define methods separate from class.
 - `defer` allows to schedule a cleanup from the main control flow.
 - `error` and package `errors` provide error chaining.
- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
+- `b`, `u`, `bstr`/`ustr` and `biter`/`uiter` provide uniform UTF8-based approach to strings.
 - `gimport` allows to import python modules by full path in a Go workspace.
 See README for thorough overview.
@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import
 __version__ = "0.1"
 __all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
-           'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'bbyte', 'uchr', 'gimport']
+           'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'biter', 'uiter', 'bbyte', 'uchr',
+           'gimport']
 import setuptools_dso
 setuptools_dso.dylink_prepare_dso('golang.runtime.libgolang')
@@ -323,4 +324,6 @@ from ._golang import    \
    pybbyte     as bbyte,   \
    pyu         as u,       \
    pyustr      as ustr,    \
-    pyuchr      as uchr
+    pyuchr      as uchr,    \
+    pybiter     as biter,   \
+    pyuiter     as uiter
--- a/golang/_golang_str.pyx
+++ b/golang/_golang_str.pyx
@@ -141,7 +141,7 @@ cpdef pyb(s): # -> bstr
          b(u(bytes_input))  is bstr with the same data as bytes_input.
-       See also: u, bstr/ustr.
+       See also: u, bstr/ustr, biter/uiter.
    """
    bs = _pyb(pybstr, s)
    if bs is None:
@@ -164,7 +164,7 @@ cpdef pyu(s): # -> ustr
          u(b(unicode_input))  is ustr with the same data as unicode_input.
-       See also: b, bstr/ustr.
+       See also: b, bstr/ustr, biter/uiter.
    """
    us = _pyu(pyustr, s)
    if us is None:
@@ -280,8 +280,6 @@ cdef __pystr(object obj): # -> ~str
        return pyb(obj)
-# XXX -> bchr ?  (not good as "character" means "unicode character")
-#     -> bstr.chr ?
 def pybbyte(int i): # -> 1-byte bstr
    """bbyte(i) returns 1-byte bstr with ordinal i."""
    return pyb(bytearray([i]))
@@ -318,11 +316,11 @@ cdef class _pybstr(bytes):   # https://github.com/cython/cython/issues/711
    is always identity even if bytes data is not valid UTF-8.
-    Semantically bstr is array of bytes. Accessing its elements by [index]
+    Semantically bstr is array of bytes. Accessing its elements by [index] and
-    yields byte character. Iterating through bstr, however, yields unicode
+    iterating it yield byte character. However it is possible to yield unicode
-    characters. In practice bstr is enough 99% of the time, and ustr only
+    character when iterating bstr via uiter. In practice bstr + uiter is enough
-    needs to be used for random access to string characters. See
+    99% of the time, and ustr only needs to be used for random access to string
-    https://blog.golang.org/strings for overview of this approach.
+    characters. See https://blog.golang.org/strings for overview of this approach.
    Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr.
    When the coercion happens, bytes and bytearray, similarly to bstr, are also
@@ -337,7 +335,7 @@ cdef class _pybstr(bytes):   # https://github.com/cython/cython/issues/711
      to bstr. See b for details.
    - otherwise bstr will have string representation of the object.
-    See also: b, ustr/u.
+    See also: b, ustr/u, biter/uiter.
    """
    # XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
@@ -414,10 +412,13 @@ cdef class _pybstr(bytes):   # https://github.com/cython/cython/issues/711
            else:
                return pyb(x)
-    # __iter__  - yields unicode characters
+    # __iter__
    def __iter__(self):
-        # TODO iterate without converting self to u
+        if PY_MAJOR_VERSION >= 3:
-        return pyu(self).__iter__()
+            return _pybstrIter(zbytes.__iter__(self))
+        else:
+            # on python 2 str does not have .__iter__
+            return PySeqIter_New(self)
    # __contains__
@@ -668,8 +669,8 @@ cdef class _pyustr(unicode):
    elements by [index] yields unicode characters.
    ustr complements bstr and is meant to be used only in situations when
-    random access to string characters is needed. Otherwise bstr is more
+    random access to string characters is needed. Otherwise bstr + uiter is
-    preferable and should be enough 99% of the time.
+    more preferable and should be enough 99% of the time.
    Operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr.
    When the coercion happens, bytes and bytearray, similarly to bstr, are also
@@ -678,7 +679,7 @@ cdef class _pyustr(unicode):
    ustr constructor, similarly to the one in bstr, accepts arbitrary objects
    and stringify them. Please refer to bstr and u documentation for details.
-    See also: u, bstr/b.
+    See also: u, bstr/b, biter/uiter.
    """
    # XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
@@ -983,17 +984,43 @@ cdef PyObject* _pyustr_tp_new(PyTypeObject* _cls, PyObject* _argv, PyObject* _kw
 assert sizeof(_pyustr) == sizeof(PyUnicodeObject)
-# _pyustrIter wraps unicode iterator to return pyustr for each yielded character.
+# _pybstrIter wraps bytes   iterator to return pybstr for each yielded byte.
+cdef class _pybstrIter:
+    cdef object zbiter
+    def __init__(self, zbiter):
+        self.zbiter = zbiter
+    def __iter__(self):
+        return self
+    def __next__(self):
+        x = next(self.zbiter)
+        if PY_MAJOR_VERSION >= 3:
+            return pybbyte(x)
+        else:
+            return pyb(x)
+# _pyustrIter wraps zunicode iterator to return pyustr for each yielded character.
 cdef class _pyustrIter:
-    cdef object uiter
+    cdef object zuiter
-    def __init__(self, uiter):
+    def __init__(self, zuiter):
-        self.uiter = uiter
+        self.zuiter = zuiter
    def __iter__(self):
        return self
    def __next__(self):
-        x = next(self.uiter)
+        x = next(self.zuiter)
        return pyu(x)
+def pybiter(obj):
+    """biter(obj) is like iter(b(obj)) but  TODO: iterates object incrementally
+    without doing full convertion to bstr."""
+    return iter(pyb(obj))   # TODO iterate obj directly
+def pyuiter(obj):
+    """uiter(obj) is like iter(u(obj)) but  TODO: iterates object incrementally
+    without doing full convertion to ustr."""
+    return iter(pyu(obj))   # TODO iterate obj directly
 # _pyustrTranslateTab wraps table for .translate to return bstr as unicode
 # because unicode.translate does not accept bstr values.
 cdef class _pyustrTranslateTab:

--- a/golang/golang_str_test.py
+++ b/golang/golang_str_test.py
@@ -21,7 +21,7 @@
 from __future__ import print_function, absolute_import
 import golang
-from golang import b, u, bstr, ustr, bbyte, uchr, func, defer, panic
+from golang import b, u, bstr, ustr, biter, uiter, bbyte, uchr, func, defer, panic
 from golang._golang import _udata, _bdata
 from golang.gcompat import qq
 from golang.strconv_test import byterange
@@ -617,35 +617,38 @@ def test_strings_index2():
 # verify strings iteration.
 def test_strings_iter():
+    # iter(u/unicode) + uiter(*) -> iterate unicode characters
+    # iter(b/bytes)   + biter(*) -> iterate byte    characters
    us = u("миру мир"); u_ = u"миру мир"
-    bs = b("миру мир")
+    bs = b("миру мир"); b_ = xbytes("миру мир"); a_ = xbytearray(b_)
-    # iter( b/u/unicode ) -> iterate unicode characters
+    # XIter verifies that going through all given iterators produces the same type and results.
-    # NOTE that iter(b) too yields unicode characters - not integers or bytes
+    missing=object()
-    #bi  = iter(bs)         # XXX temp disabled
-    bi  = iter(us)
-    ui  = iter(us)
-    ui_ = iter(u_)
    class XIter:
+        def __init__(self, typok, *viter):
+            self.typok = typok
+            self.viter = viter
        def __iter__(self):
            return self
-        def __next__(self, missing=object):
+        def __next__(self):
-            x = next(bi, missing)
+            vnext = []
-            y = next(ui, missing)
+            for it in self.viter:
-            z = next(ui_, missing)
+                obj = next(it, missing)
-            assert type(x) is type(y)
+                vnext.append(obj)
-            if x is not missing:
+            if missing in vnext:
-                assert type(x) is ustr
+                assert vnext == [missing]*len(self.viter)
-            if z is not missing:
-                assert type(z) is unicode
-            assert x == y
-            assert y == z
-            if x is missing:
                raise StopIteration
-            return x
+            for obj in vnext:
+                assert type(obj) is self.typok
+                assert obj == vnext[0]
+            return vnext[0]
        next = __next__ # py2
-    assert list(XIter()) == ['м','и','р','у',' ','м','и','р']
+    assert list(XIter(ustr, iter(us), uiter(us), uiter(u_), uiter(bs), uiter(b_), uiter(a_))) == \
+                ['м','и','р','у',' ','м','и','р']
+    assert list(XIter(bstr, iter(bs), biter(us), biter(u_), biter(bs), biter(b_), biter(a_))) == \
+                [b'\xd0',b'\xbc',b'\xd0',b'\xb8',b'\xd1',b'\x80',b'\xd1',b'\x83',b' ',
+                 b'\xd0',b'\xbc',b'\xd0',b'\xb8',b'\xd1',b'\x80']
 # verify .encode/.decode .

--- a/gpython/gpython_test.py
+++ b/gpython/gpython_test.py
@@ -87,6 +87,8 @@ def test_golang_builtins():
    assert u      is golang.u
    assert bstr   is golang.bstr
    assert ustr   is golang.ustr
+    assert biter  is golang.biter
+    assert uiter  is golang.uiter
    assert bbyte  is golang.bbyte
    assert uchr   is golang.uchr