-
Kirill Smelkov authored
On macos and windows, Python2 is built with --enable-unicode=ucs2, which makes it to use UTF-16 encoding for unicode characters, and so for characters higher than U+10000 it uses surrogate encoding with _2_ unicode points, for example: >>> import sys >>> sys.maxunicode 65535 <-- NOTE indicates UCS2 build >>> s = u'\U00012345' >>> s u'\U00012345' >>> s.encode('utf-8') '\xf0\x92\x8d\x85' >>> len(s) 2 <-- NOTE _not_ 1 >>> s[0] u'\ud808' >>> s[1] u'\udf45' This leads to e.g. b tests failing for # tbytes tunicode (b"\xf0\x90\x8c\xbc", u'\U0001033c'), # Valid 4 Octet Sequence '𐌼' > assert b(tunicode) == tbytes E AssertionError: assert '\xed\xa0\x80\xed\xbc\xbc' == '\xf0\x90\x8c\xbc' E - \xed\xa0\x80\xed\xbc\xbc E + \xf0\x90\x8c\xbc because on UCS2 python build u'\U0001033c' is represented as 2 unicode points: >>> s = u'\U0001033c' >>> len(s) 2 >>> s[0] u'\ud800' >>> s[1] u'\udf3c' >>> s[0].encode('utf-8') '\xed\xa0\x80' >>> s[1].encode('utf-8') '\xed\xbc\xbc' -> Fix it by detecting UCS2 build and working around by manually combining such surrogate unicode pairs appropriately. A reference on the subject: https://matthew-brett.github.io/pydagogue/python_unicode.html#utf-16-ucs2-builds-of-python-and-32-bit-unicode-code-points
0561926a