-
Kirill Smelkov authored
Even though bstr is semantically array of bytes, while ustr is array of unicode characters, iterating them _both_ yields unicode characters. This goes in line with Go approach described in "Strings, bytes, runes and characters in Go"[1] and allows for both ustr _and_ bstr to be used as strings in unicode world. Even though this diverges (just a bit) from str/py2 str behaviur, and diverges more from bytes/py3 behaviour, I have not hit any problem in practice due to this divergence. In other words the semantics of bytestring used in Go - to iterate them as unicode characters - is sound. For the reference it is the authors of Go who originally invented UTF-8 - see [2] for details. See also [3] for our discussion with Jérome on this topic. [1] https://blog.golang.org/strings [2] https://www.cl.cam.ac.uk/~mgk25/ucs/UTF-8-Plan9-paper.pdf [3] nexedi/zodbtools!13 (comment 81646)
a72c1c1a