README.rst · a72c1c1a89b85a3d995904fd22f22645b660006d · Kirill Smelkov / pygolang

golang_str: bstr/ustr iteration · a72c1c1a

Kirill Smelkov authored Oct 07, 2022

Even though bstr is semantically array of bytes, while ustr is array of
unicode characters, iterating them _both_ yields unicode characters.
This goes in line with Go approach described in "Strings, bytes, runes
and characters in Go"[1] and allows for both ustr _and_ bstr to be used
as strings in unicode world.

Even though this diverges (just a bit) from str/py2 str behaviur, and
diverges more from bytes/py3 behaviour, I have not hit any problem in
practice due to this divergence. In other words the semantics of
bytestring used in Go - to iterate them as unicode characters - is
sound. For the reference it is the authors of Go who originally invented
UTF-8 - see [2] for details.

See also [3] for our discussion with Jérome on this topic.

[1] https://blog.golang.org/strings
[2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf
[3] nexedi/zodbtools!13 (comment 81646)

a72c1c1a

README.rst 20.3 KB

Replace README.rst