Commit 50b3808c authored by Kirill Smelkov's avatar Kirill Smelkov

Uniform UTF8-based approach to strings

Context: together with Jérome we've been struggling with porting Zodbtools to
Python3 for several years. Despite several incremental attempts[1,2,3]
we are not there yet with the main difficulty being backward compatibility breakage
that Python3 did for bytes and unicode. During my last trial this spring, after
I've tried once again to finish this porting and could not reach satisfactory
result, I've finally decided to do something about this at the root of the
cause: at the level of strings - where backward compatibility was broken - with
the idea to fix everything once and for all.

In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost
overview"[5] Jean-Paul highlighted the problem of strings backward
compatibility breakage, that Python 3 did, as the major one.

In 2019 we had some conversations with Jérome about this topic as well[6,7].

In 2020 I've started to approach it with `b` and `u` that provide
always-working conversion in between bytes and unicode[8], and via limited
usage of custom bytes- and unicode- like types that are interoperable with both
bytes and unicode simultaneously[9].

Today, with this work, I'm finally exposing those types for general usage, so
that bytes/unicode problem could be handled automatically. The overview of the
functionality is provided below:

---- 8< ----

Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
the idea to make working with byte- and unicode- strings easy and transparently
interoperable:

- `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` (*).
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`.

The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.

Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.

Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly (+). However it is possible to yield unicode
character when iterating `bstr` via `uiter`, and to yield byte character when
iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of the
time, and `ustr` only needs to be used for random access to string characters.
See [Strings, bytes, runes and characters in Go](https://blog.golang.org/strings) for overview of this approach.

Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
to `ustr`.  When the coercion happens, `bytes` and `bytearray`, similarly to
`bstr`, are also treated as UTF8-encoded strings.

`bstr` and `ustr` are meant to be drop-in replacements for standard
`str`/`unicode` classes. They support all methods of `str`/`unicode` and in
particular their constructors accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer`
interface (%), to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.

Usage example:

```py
   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
   s += ' мир'          # s is b('привет мир')
   for c in uiter(s):   # c will iterate through
        ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]

   # the following gives b('привет мир труд май')
   b('привет %s %s %s') % (u'мир',                  # raw unicode
                           u'труд'.encode('utf-8'), # raw bytes
                           u('май'))                # ustr

   def f(s):
      s = u(s)          # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
      ...               # (^) the decoding never fails nor looses information.
```

(*) `unicode` on Python2, `str` on Python3.
(+) ordinal of such byte and unicode character can be obtained via regular `ord`.
    For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal.
(%) data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string.
    Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.

---- 8< ----

With this e.g. zodbtools is finally ported to Python3 easily[10].

One note is that we change `b` and `u` to return `bstr`/`ustr` instead of
`bytes`/`unicode`. This is change in behaviour, but I hope it won't break
anything. The reason for this is that now-returned `bstr` and `ustr` are meant
to be drop-in replacements for standard string types, and that there are not
many existing `b` and `u` users. We just need to make sure that the places,
that already use `b` and `u` continue to work. Those include Zodbtools,
Nxdtest[11], and lonet[12], which should continue to work ok.

@klaus, you once said that you use `b` and `u` somewhere as well. Please do not
hesitate to let me know if this change causes any issues for you, and we will,
hopefully, try to find a solution.

Kirill

/cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya
/reviewed-and-discussed-on nexedi/pygolang!21

[1] nexedi/zodbtools!12
[2] nexedi/zodbtools!13
[3] nexedi/zodbtools!16
[4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
[6] nexedi/zodbtools!8 (comment 73726)
[7] nexedi/zodbtools!13 (comment 81646)
[8] nexedi/pygolang@bcb95cd5
[9] nexedi/pygolang@edc7aaab
[10] nexedi/zodbtools@9861c136
[11] https://lab.nexedi.com/nexedi/nxdtest
[12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py
parents f59a785d 5bf08f8b
...@@ -2,6 +2,9 @@ include COPYING README.rst CHANGELOG.rst tox.ini pyproject.toml trun .lsan-ignor ...@@ -2,6 +2,9 @@ include COPYING README.rst CHANGELOG.rst tox.ini pyproject.toml trun .lsan-ignor
include golang/libgolang.h include golang/libgolang.h
include golang/runtime/libgolang.cpp include golang/runtime/libgolang.cpp
include golang/runtime/libpyxruntime.cpp include golang/runtime/libpyxruntime.cpp
include golang/runtime/platform.h
include golang/runtime.h
include golang/runtime.cpp
include golang/pyx/runtime.h include golang/pyx/runtime.h
include golang/pyx/testprog/golang_dso_user/dsouser/dso.h include golang/pyx/testprog/golang_dso_user/dsouser/dso.h
include golang/pyx/testprog/golang_dso_user/dsouser/dso.cpp include golang/pyx/testprog/golang_dso_user/dsouser/dso.cpp
......
...@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python: ...@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python:
- `func` allows to define methods separate from class. - `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow. - `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining. - `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode. - `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace. - `gimport` allows to import python modules by full path in a Go workspace.
Package `golang.pyx` provides__ similar features for Cython/nogil. Package `golang.pyx` provides__ similar features for Cython/nogil.
...@@ -229,19 +229,64 @@ __ https://www.python.org/dev/peps/pep-3134/ ...@@ -229,19 +229,64 @@ __ https://www.python.org/dev/peps/pep-3134/
Strings Strings
------- -------
`b` and `u` provide way to make sure an object is either bytes or unicode. Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
`b(obj)` converts str/unicode/bytes obj to UTF-8 encoded bytestring, while the idea to make working with byte- and unicode- strings easy and transparently
`u(obj)` converts str/unicode/bytes obj to unicode string. For example:: interoperable:
b("привет мир") # -> gives bytes corresponding to UTF-8 encoding of "привет мир". - `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` [*]_.
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`.
def f(s): The conversion, in both encoding and decoding, never fails and never looses
s = u(s) # make sure s is unicode, decoding as UTF-8(*) if it was bytes. information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
... # (*) but see below about lack of decode errors. even if bytes data is not valid UTF-8.
Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.
Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly [*]_. However it is possible to yield unicode
character when iterating `bstr` via `uiter`, and to yield byte character when
iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of
the time, and `ustr` only needs to be used for random access to string
characters. See `Strings, bytes, runes and characters in Go`__ for overview of
this approach.
__ https://blog.golang.org/strings
Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
to `ustr`. When the coercion happens, `bytes` and `bytearray`, similarly to
`bstr`, are also treated as UTF8-encoded strings.
The conversion in both encoding and decoding never fails and never looses `bstr` and `ustr` are meant to be drop-in replacements for standard
information: `b(u(·))` and `u(b(·))` are always identity for bytes and unicode `str`/`unicode` classes. They support all methods of `str`/`unicode` and in
correspondingly, even if bytes input is not valid UTF-8. particular their constructors accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer`
interface [*]_, to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.
Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
s += ' мир' # s is b('привет мир')
for c in uiter(s): # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
# the following gives b('привет мир труд май')
b('привет %s %s %s') % (u'мир', # raw unicode
u'труд'.encode('utf-8'), # raw bytes
u('май')) # ustr
def f(s):
s = u(s) # make sure s is ustr, decoding as UTF-8(*) if it was bstr, bytes, bytearray or buffer.
... # (*) the decoding never fails nor looses information.
.. [*] `unicode` on Python2, `str` on Python3.
.. [*] | ordinal of such byte and unicode character can be obtained via regular `ord`.
| For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal.
.. [*] | data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string.
| Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.
Import Import
......
...@@ -9,6 +9,7 @@ ...@@ -9,6 +9,7 @@
/_io.cpp /_io.cpp
/_os.cpp /_os.cpp
/_os_test.cpp /_os_test.cpp
/_strconv.cpp
/_strings_test.cpp /_strings_test.cpp
/_sync.cpp /_sync.cpp
/_sync_test.cpp /_sync_test.cpp
......
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2018-2024 Nexedi SA and Contributors. # Copyright (C) 2018-2025 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -24,7 +24,7 @@ ...@@ -24,7 +24,7 @@
- `func` allows to define methods separate from class. - `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow. - `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining. - `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode. - `b`, `u`, `bstr`/`ustr` and `biter`/`uiter` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace. - `gimport` allows to import python modules by full path in a Go workspace.
See README for thorough overview. See README for thorough overview.
...@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import ...@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import
__version__ = "0.1" __version__ = "0.1"
__all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic', __all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
'recover', 'func', 'error', 'b', 'u', 'gimport'] 'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'biter', 'uiter', 'bbyte', 'uchr',
'gimport']
import setuptools_dso import setuptools_dso
setuptools_dso.dylink_prepare_dso('golang.runtime.libgolang') setuptools_dso.dylink_prepare_dso('golang.runtime.libgolang')
...@@ -369,12 +370,11 @@ from ._golang import \ ...@@ -369,12 +370,11 @@ from ._golang import \
pypanic as panic, \ pypanic as panic, \
pyerror as error, \ pyerror as error, \
pyb as b, \ pyb as b, \
pyu as u pybstr as bstr, \
pybbyte as bbyte, \
# import golang.strconv into _golang from here to workaround cyclic golang ↔ strconv dependency pyu as u, \
def _(): pyustr as ustr, \
from . import _golang pyuchr as uchr, \
from . import strconv pybiter as biter, \
_golang.pystrconv = strconv pyuiter as uiter, \
_() _butf8b
del _
# cython: language_level=2 # cython: language_level=2
# Copyright (C) 2019-2022 Nexedi SA and Contributors. # Copyright (C) 2019-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py: ...@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py:
- Python-level channels are represented by pychan + pyselect. - Python-level channels are represented by pychan + pyselect.
- Python-level error is represented by pyerror. - Python-level error is represented by pyerror.
- Python-level panic is represented by pypanic. - Python-level panic is represented by pypanic.
- Python-level strings are represented by pybstr/pyustr and pyb/pyu.
""" """
...@@ -64,6 +65,9 @@ cdef extern from *: ...@@ -64,6 +65,9 @@ cdef extern from *:
# on the edge of Python/nogil world. # on the edge of Python/nogil world.
from libcpp.string cimport string # golang::string = std::string from libcpp.string cimport string # golang::string = std::string
cdef extern from "golang/libgolang.h" namespace "golang" nogil: cdef extern from "golang/libgolang.h" namespace "golang" nogil:
ctypedef unsigned char byte
ctypedef signed int rune # = int32
void panic(const char *) void panic(const char *)
const char *recover() const char *recover()
...@@ -265,4 +269,11 @@ cdef class pyerror(Exception): ...@@ -265,4 +269,11 @@ cdef class pyerror(Exception):
cdef object from_error (error err) # -> pyerror | None cdef object from_error (error err) # -> pyerror | None
# strings
cpdef pyb(s) # -> bstr
cpdef pyu(s) # -> ustr
cdef __pystr(object obj) cdef __pystr(object obj)
cdef (rune, int) _utf8_decode_rune(const byte[::1] s)
cdef unicode _xunichr(rune i)
...@@ -3,7 +3,7 @@ ...@@ -3,7 +3,7 @@
# cython: binding=False # cython: binding=False
# cython: c_string_type=str, c_string_encoding=utf8 # cython: c_string_type=str, c_string_encoding=utf8
# distutils: language = c++ # distutils: language = c++
# distutils: depends = libgolang.h os/signal.h _golang_str.pyx # distutils: depends = libgolang.h os/signal.h unicode/utf8.h _golang_str.pyx _golang_str_pickle.pyx
# #
# Copyright (C) 2018-2024 Nexedi SA and Contributors. # Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
......
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2018-2023 Nexedi SA and Contributors. # Copyright (C) 2018-2025 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -22,143 +22,1189 @@ ...@@ -22,143 +22,1189 @@
It is included from _golang.pyx . It is included from _golang.pyx .
""" """
from golang.unicode cimport utf8
from cpython cimport PyUnicode_AsUnicode, PyUnicode_GetSize, PyUnicode_FromUnicode
from cpython cimport PyUnicode_DecodeUTF8 from cpython cimport PyUnicode_DecodeUTF8
from cpython cimport PyTypeObject, Py_TYPE, reprfunc, richcmpfunc, binaryfunc
from cpython cimport Py_EQ, Py_NE, Py_LT, Py_GT, Py_LE, Py_GE
from cpython.iterobject cimport PySeqIter_New
from cpython cimport PyThreadState_GetDict, PyDict_SetItem
from cpython cimport PyObject_CheckBuffer
cdef extern from "Python.h":
PyTypeObject PyBytes_Type
ctypedef struct PyBytesObject:
pass
from libc.stdint cimport uint8_t cdef extern from "Python.h":
PyTypeObject PyUnicode_Type
ctypedef struct PyUnicodeObject:
pass
pystrconv = None # = golang.strconv imported at runtime (see __init__.py) cdef extern from "Python.h":
"""
#if PY_MAJOR_VERSION < 3
// on py2, PyDict_GetItemWithError is called _PyDict_GetItemWithError
// NOTE Cython3 provides PyDict_GetItemWithError out of the box
# define PyDict_GetItemWithError _PyDict_GetItemWithError
#endif
"""
PyObject* PyDict_GetItemWithError(object, object) except? NULL # borrowed ref
def pyb(s): # -> bytes Py_ssize_t PY_SSIZE_T_MAX
"""b converts str/unicode/bytes s to UTF-8 encoded bytestring. void PyType_Modified(PyTypeObject *)
Bytes input is preserved as-is: cdef extern from "Python.h":
ctypedef int (*initproc)(object, PyObject *, PyObject *) except -1
ctypedef struct _XPyTypeObject "PyTypeObject":
PyObject* tp_new(PyTypeObject*, PyObject*, PyObject*) except NULL
initproc tp_init
PySequenceMethods *tp_as_sequence
b(bytes_input) == bytes_input ctypedef struct PySequenceMethods:
binaryfunc sq_concat
binaryfunc sq_inplace_concat
object (*sq_slice) (object, Py_ssize_t, Py_ssize_t) # present only on py2
Unicode input is UTF-8 encoded. The encoding always succeeds.
b is reverse operation to u - the following invariant is always true:
b(u(bytes_input)) == bytes_input from cython cimport no_gc
TypeError is raised if type(s) is not one of the above. from libc.stdio cimport FILE
See also: u. from golang cimport strconv
""" import codecs as pycodecs
if isinstance(s, bytes): # py2: str py3: bytes import string as pystring
pass import types as pytypes
elif isinstance(s, unicode): # py2: unicode py3: str import functools as pyfunctools
s = _utf8_encode_surrogateescape(s) import re as pyre
else:
raise TypeError("b: invalid type %s" % type(s))
return s
def pyu(s): # -> unicode # zbytes/zunicode point to original std bytes/unicode types even if they will be patched.
"""u converts str/unicode/bytes s to unicode string. # we use them to invoke original bytes/unicode methods.
cdef object zbytes = <object>(&PyBytes_Type)
cdef object zunicode = <object>(&PyUnicode_Type)
# pybstr/pyustr point to version of bstr/ustr types that is actually in use:
# - when bytes/unicode are not patched -> to _pybstr/_pyustr
# - when bytes/unicode will be patched -> to bytes/unicode to where original
# _pybstr/_pyustr were copied during bytes/unicode patching.
# at runtime the code should use pybstr/pyustr instead of _pybstr/_pyustr.
pybstr = _pybstr # initially point to -> _pybstr/_pyustr
pyustr = _pyustr # TODO -> cdef for speed
cpdef pyb(s): # -> bstr
"""b converts object to bstr.
- For bstr the same object is returned.
- For bytes, bytearray, or object with buffer interface, the data is
preserved as-is and only result type is changed to bstr.
- For ustr/unicode the data is UTF-8 encoded. The encoding always succeeds.
TypeError is raised if type(s) is not one of the above.
b is reverse operation to u - the following invariant is always true:
Unicode input is preserved as-is: b(u(bytes_input)) is bstr with the same data as bytes_input.
u(unicode_input) == unicode_input See also: u, bstr/ustr, biter/uiter.
"""
bs = _pyb(pybstr, s)
if bs is None:
raise TypeError("b: invalid type %s" % type(s))
return bs
Bytes input is UTF-8 decoded. The decoding always succeeds and input cpdef pyu(s): # -> ustr
"""u converts object to ustr.
- For ustr the same object is returned.
- For unicode the data is preserved as-is and only result type is changed to ustr.
- For bstr, bytes, bytearray, or object with buffer interface, the data is UTF-8 decoded.
The decoding always succeeds and input
information is not lost: non-valid UTF-8 bytes are decoded into information is not lost: non-valid UTF-8 bytes are decoded into
surrogate codes ranging from U+DC80 to U+DCFF. surrogate codes ranging from U+DC80 to U+DCFF.
u is reverse operation to b - the following invariant is always true:
u(b(unicode_input)) == unicode_input
TypeError is raised if type(s) is not one of the above. TypeError is raised if type(s) is not one of the above.
See also: b. u is reverse operation to b - the following invariant is always true:
u(b(unicode_input)) is ustr with the same data as unicode_input.
See also: b, bstr/ustr, biter/uiter.
""" """
if isinstance(s, unicode): # py2: unicode py3: str us = _pyu(pyustr, s)
pass if us is None:
elif isinstance(s, bytes): # py2: str py3: bytes
s = _utf8_decode_surrogateescape(s)
else:
raise TypeError("u: invalid type %s" % type(s)) raise TypeError("u: invalid type %s" % type(s))
return us
cdef _pyb(bcls, s): # -> ~bstr | None
if type(s) is bcls:
return s return s
if isinstance(s, bytes):
if type(s) is not bytes:
s = _bdata(s)
elif isinstance(s, unicode):
s = _utf8_encode_surrogateescape(s)
else:
s = _ifbuffer_data(s) # bytearray and buffer
if s is None:
return None
assert type(s) is bytes
# like zbytes.__new__(bcls, s) but call zbytes.tp_new directly
# else tp_new_wrapper complains because pybstr.tp_new != zbytes.tp_new
argv = (s,)
obj = <object>(<_XPyTypeObject*>zbytes).tp_new(<PyTypeObject*>bcls, <PyObject*>argv, NULL)
Py_DECREF(obj)
return obj
cdef _pyu(ucls, s): # -> ~ustr | None
if type(s) is ucls:
return s
# __pystr converts obj to str of current python: if isinstance(s, unicode):
if type(s) is not unicode:
s = _udata(s)
else:
_ = _ifbuffer_data(s) # bytearray and buffer
if _ is not None:
s = _
if isinstance(s, bytes):
s = _utf8_decode_surrogateescape(s)
else:
return None
assert type(s) is unicode
# like zunicode .__new__(bcls, s) but call zunicode.tp_new directly
# else tp_new_wrapper complains because pyustr.tp_new != zunicode.tp_new
argv = (s,)
obj = <object>(<_XPyTypeObject*>zunicode).tp_new(<PyTypeObject*>ucls, <PyObject*>argv, NULL)
Py_DECREF(obj)
return obj
# _ifbuffer_data returns contained data if obj provides buffer interface.
cdef _ifbuffer_data(obj): # -> bytes|None
if PyObject_CheckBuffer(obj):
if PY_MAJOR_VERSION >= 3:
return bytes(obj)
else:
# py2: bytes(memoryview) returns '<memory at ...>'
return bytes(bytearray(obj))
elif _XPyObject_CheckOldBuffer(obj): # old-style buffer, py2-only
return bytes(_buffer_py2(obj))
else:
return None
# _pyb_coerce coerces x from `b op x` to be used in operation with pyb.
cdef _pyb_coerce(x): # -> bstr|bytes
if isinstance(x, bytes):
return x
elif isinstance(x, (unicode, bytearray)):
return pyb(x)
else:
raise TypeError("b: coerce: invalid type %s" % type(x))
# _pyu_coerce coerces x from `u op x` to be used in operation with pyu.
cdef _pyu_coerce(x): # -> ustr|unicode
if isinstance(x, unicode):
return x
elif isinstance(x, (bytes, bytearray)):
return pyu(x)
else:
raise TypeError("u: coerce: invalid type %s" % type(x))
# _pybu_rcoerce coerces x from `x op b|u` to either bstr or ustr.
# NOTE bytearray is handled outside of this function.
cdef _pybu_rcoerce(x): # -> bstr|ustr
if isinstance(x, bytes):
return pyb(x)
elif isinstance(x, unicode):
return pyu(x)
else:
raise TypeError('b/u: coerce: invalid type %s' % type(x))
# __pystr converts obj to ~str of current python:
# #
# - to bytes, via b, if running on py2, or # - to ~bytes, via b, if running on py2, or
# - to unicode, via u, if running on py3. # - to ~unicode, via u, if running on py3.
# #
# It is handy to use __pystr when implementing __str__ methods. # It is handy to use __pystr when implementing __str__ methods.
# #
# NOTE __pystr is currently considered to be internal function and should not # NOTE __pystr is currently considered to be internal function and should not
# be used by code outside of pygolang. # be used by code outside of pygolang.
# #
# XXX we should be able to use _pystr, but py3's str verify that it must have # XXX we should be able to use pybstr, but py3's str verify that it must have
# Py_TPFLAGS_UNICODE_SUBCLASS in its type flags. # Py_TPFLAGS_UNICODE_SUBCLASS in its type flags.
cdef __pystr(object obj): cdef __pystr(object obj): # -> ~str
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
return pyu(obj) return pyu(obj)
else: else:
return pyb(obj) return pyb(obj)
# XXX cannot `cdef class`: github.com/cython/cython/issues/711 def pybbyte(int i): # -> 1-byte bstr
class _pystr(bytes): """bbyte(i) returns 1-byte bstr with ordinal i."""
"""_str is like bytes but can be automatically converted to Python unicode return pyb(bytearray([i]))
string via UTF-8 decoding.
The decoding never fails nor looses information - see u for details. def pyuchr(int i): # -> 1-character ustr
""" """uchr(i) returns 1-character ustr with unicode ordinal i."""
return pyu(unichr(i))
# don't allow to set arbitrary attributes.
# won't be needed after switch to -> `cdef class`
__slots__ = ()
@no_gc # note setup.py assist this to compile despite
cdef class _pybstr(bytes): # https://github.com/cython/cython/issues/711
"""bstr is byte-string.
# __bytes__ - no need It is based on bytes and can automatically convert to/from unicode.
def __unicode__(self): return pyu(self) The conversion never fails and never looses information:
bstr → ustr → bstr
is always identity even if bytes data is not valid UTF-8.
Semantically bstr is array of bytes. Accessing its elements by [index] and
iterating it yield byte character. However it is possible to yield unicode
character when iterating bstr via uiter. In practice bstr + uiter is enough
99% of the time, and ustr only needs to be used for random access to string
characters. See https://blog.golang.org/strings for overview of this approach.
Operations in between bstr and ustr/unicode / bytes/bytearray coerce to bstr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also
treated as UTF8-encoded strings.
bstr constructor accepts arbitrary objects and stringify them:
- if encoding and/or errors is specified, the object must provide buffer
interface. The data in the buffer is decoded according to provided
encoding/errors and further encoded via UTF-8 into bstr.
- if the object is bstr/ustr / unicode/bytes/bytearray - it is converted
to bstr. See b for details.
- otherwise bstr will have string representation of the object.
See also: b, ustr/u, biter/uiter.
"""
# XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
# _pybstr.__new__ is hand-made in _pybstr_tp_new which invokes ↓ .____new__() .
@staticmethod
def ____new__(cls, object='', encoding=None, errors=None):
# encoding or errors -> object must expose buffer interface
if not (encoding is None and errors is None):
object = _buffer_decode(object, encoding, errors)
# _bstringify. Note: it handles bstr/ustr / unicode/bytes/bytearray as documented
object = _bstringify(object)
assert isinstance(object, (unicode, bytes)), object
bobj = _pyb(cls, object)
assert bobj is not None
return bobj
# __bytes__ converts string to bytes leaving string domain.
# NOTE __bytes__ and encode are the only operations that leave string domain.
# NOTE __bytes__ is used only by py3 and only for `bytes(obj)` and `b'%s/%b' % obj`.
def __bytes__(self): return _bdata(self) # -> bytes
def __unicode__(self): return pyu(self)
def __str__(self): def __str__(self):
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
return pyu(self) return pyu(self)
else: else:
return self return pyb(self) # self or pybstr if it was subclass
def __repr__(self):
qself, nonascii_escape = _bpysmartquote_u3b2(self)
bs = _inbstringify_get()
if bs.inbstringify == 0 or bs.inrepr:
if nonascii_escape: # so that e.g. b(u'\x80') is represented as
qself = 'b' + qself # b(b'\xc2\x80'), not as b('\xc2\x80')
return "b(" + qself + ")"
else:
# [b('β')] goes as ['β'] when under _bstringify for %s
return qself
def __reduce_ex__(self, protocol):
return _bstr__reduce_ex__(self, protocol)
def __hash__(self):
# hash of the same unicode and UTF-8 encoded bytes is generally different
# -> we can't make hash(bstr) == both hash(bytes) and hash(unicode) at the same time.
# -> make hash(bstr) == hash(str type of current python) so that bstr
# could be used as keys in dictionary interchangeably with native str type.
if PY_MAJOR_VERSION >= 3:
return hash(pyu(self))
else:
return zbytes.__hash__(self)
# == != < > <= >=
# NOTE all operations must succeed against any type so that bstr could be
# used as dict key and arbitrary three-way comparisons, done by python,
# work correctly. This means that on py2 e.g. `bstr > int` will behave
# exactly as builtin str and won't raise TypeError. On py3 TypeError is
# raised for such operations by python itself when it receives
# NotImplemented from all tried methods.
def __eq__(a, b):
try:
b = _pyb_coerce(b)
except TypeError:
return NotImplemented
return zbytes.__eq__(a, b)
def __ne__(a, b):
try:
b = _pyb_coerce(b)
except TypeError:
return NotImplemented
return zbytes.__ne__(a, b)
def __lt__(a, b):
try:
b = _pyb_coerce(b)
except TypeError:
return NotImplemented
return zbytes.__lt__(a, _pyb_coerce(b))
def __gt__(a, b):
try:
b = _pyb_coerce(b)
except TypeError:
return NotImplemented
return zbytes.__gt__(a, _pyb_coerce(b))
def __le__(a, b):
try:
b = _pyb_coerce(b)
except TypeError:
return NotImplemented
return zbytes.__le__(a, _pyb_coerce(b))
def __ge__(a, b):
try:
b = _pyb_coerce(b)
except TypeError:
return NotImplemented
return zbytes.__ge__(a, _pyb_coerce(b))
# len - no need to override
# [], [:]
def __getitem__(self, idx):
x = zbytes.__getitem__(self, idx)
if type(idx) is slice:
return pyb(x)
else:
# bytes[i] returns 1-character bytestring(py2) or int(py3)
# we always return 1-character bytestring
if PY_MAJOR_VERSION >= 3:
return pybbyte(x)
else:
return pyb(x)
# __iter__
def __iter__(self):
if PY_MAJOR_VERSION >= 3:
return _pybstrIter(zbytes.__iter__(self))
else:
# on python 2 str does not have .__iter__
return PySeqIter_New(self)
# __contains__
def __contains__(self, key):
# NOTE on py3 bytes.__contains__ accepts numbers and buffers. We don't want to
# automatically coerce any of them to bytestrings
return zbytes.__contains__(self, _pyb_coerce(key))
# __add__, __radd__ (no need to override __iadd__)
def __add__(a, b):
# NOTE Cython < 3 does not automatically support __radd__ for cdef class
# https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#arithmetic-special-methods
# see also https://github.com/cython/cython/issues/4750
if type(a) is not pybstr:
assert type(b) is pybstr
return b.__radd__(a)
try:
b = _pyb_coerce(b)
except TypeError:
if not hasattr(b, '__radd__'):
raise # don't let python to handle e.g. bstr + memoryview automatically
return NotImplemented
return pyb(zbytes.__add__(a, b))
def __radd__(b, a):
# a.__add__(b) returned NotImplementedError, e.g. for unicode.__add__(bstr)
# u'' + b() -> u() ; same as u() + b() -> u()
# b'' + b() -> b() ; same as b() + b() -> b()
# barr + b() -> barr
if isinstance(a, bytearray):
# force `bytearray +=` to go via bytearray.sq_inplace_concat - see PyNumber_InPlaceAdd
return NotImplemented
a = _pybu_rcoerce(a)
return a.__add__(b)
# __mul__, __rmul__ (no need to override __imul__)
def __mul__(a, b):
if type(a) is not pybstr:
assert type(b) is pybstr
return b.__rmul__(a)
try:
_ = zbytes.__mul__(a, b)
except TypeError: # TypeError: `b` cannot be interpreted as an integer
return NotImplemented
return pyb(_)
def __rmul__(b, a):
return b.__mul__(a)
# %-formatting
def __mod__(a, b):
return _bprintf(a, b)
def __rmod__(b, a):
# ("..." % x) calls "x.__rmod__()" for string subtypes
# determine output type as in __radd__
if isinstance(a, bytearray):
# on py2 bytearray does not implement %
return NotImplemented # no need to check for py3 - there our __rmod__ is not invoked
a = _pybu_rcoerce(a)
return a.__mod__(b)
# format
def format(self, *args, **kwargs): return pyb(pyu(self).format(*args, **kwargs))
def format_map(self, mapping): return pyb(pyu(self).format_map(mapping))
def __format__(self, format_spec):
# NOTE don't convert to b due to "TypeError: __format__ must return a str, not pybstr"
# we are ok to return ustr even for format(bstr, ...) because in
# practice format builtin is never used and it is only s.format()
# that is used in programs. This way __format__ will be invoked
# only internally.
#
# NOTE we are ok to use ustr.__format__ because the only format code
# supported by bstr/ustr/unicode __format__ is 's', not e.g. 'r'.
return pyu(self).__format__(format_spec)
# encode/decode
#
# Encode encodes unicode representation of the string into bytes, leaving string domain.
# Decode decodes bytes representation of the string into ustr, staying inside string domain.
#
# Both bstr and ustr are accepted by encode and decode treating them as two
# different representations of the same entity.
#
# On encoding, for bstr, the string representation is first converted to
# unicode and encoded to bytes from there. For ustr unicode representation
# of the string is directly encoded.
#
# On decoding, for ustr, the string representation is first converted to
# bytes and decoded to unicode from there. For bstr bytes representation of
# the string is directly decoded.
#
# NOTE __bytes__ and encode are the only operations that leave string domain.
def encode(self, encoding=None, errors=None): # -> bytes
encoding, errors = _encoding_with_defaults(encoding, errors)
if encoding == 'utf-8' and errors == 'surrogateescape':
return _bdata(self)
# on py2 e.g. bytes.encode('string-escape') works on bytes directly
if PY_MAJOR_VERSION < 3:
codec = _pycodecs_lookup_binary(encoding)
if codec is not None:
return codec.encode(self, errors)[0]
return pyu(self).encode(encoding, errors)
def decode(self, encoding=None, errors=None): # -> ustr | bstr on py2 for encodings like string-escape
encoding, errors = _encoding_with_defaults(encoding, errors)
if encoding == 'utf-8' and errors == 'surrogateescape':
x = _utf8_decode_surrogateescape(self)
else:
x = zbytes.decode(self, encoding, errors)
# on py2 e.g. bytes.decode('string-escape') returns bytes
if PY_MAJOR_VERSION < 3 and isinstance(x, bytes):
return pyb(x)
return pyu(x)
# all other string methods
def capitalize(self): return pyb(pyu(self).capitalize())
def casefold(self): return pyb(pyu(self).casefold())
def center(self, width, fillchar=' '): return pyb(pyu(self).center(width, fillchar))
def count(self, sub, start=None, end=None): return zbytes.count(self, _pyb_coerce(sub), start, end)
def endswith(self, suffix, start=None, end=None):
if isinstance(suffix, tuple):
for _ in suffix:
if self.endswith(_pyb_coerce(_), start, end):
return True
return False
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zbytes.endswith(self, _pyb_coerce(suffix), start, end)
def expandtabs(self, tabsize=8): return pyb(pyu(self).expandtabs(tabsize))
# NOTE find/index & friends should return byte-position, not unicode-position
def find(self, sub, start=None, end=None): return zbytes.find(self, _pyb_coerce(sub), start, end)
def index(self, sub, start=None, end=None): return zbytes.index(self, _pyb_coerce(sub), start, end)
def isalnum(self): return pyu(self).isalnum()
def isalpha(self): return pyu(self).isalpha()
# isascii(self) no need to override
def isdecimal(self): return pyu(self).isdecimal()
def isdigit(self): return pyu(self).isdigit()
def isidentifier(self): return pyu(self).isidentifier()
def islower(self): return pyu(self).islower()
def isnumeric(self): return pyu(self).isnumeric()
def isprintable(self): return pyu(self).isprintable()
def isspace(self): return pyu(self).isspace()
def istitle(self): return pyu(self).istitle()
def join(self, iterable): return pyb(zbytes.join(self, (_pyb_coerce(_) for _ in iterable)))
def ljust(self, width, fillchar=' '): return pyb(pyu(self).ljust(width, fillchar))
def lower(self): return pyb(pyu(self).lower())
def lstrip(self, chars=None): return pyb(pyu(self).lstrip(chars))
def partition(self, sep): return tuple(pyb(_) for _ in zbytes.partition(self, _pyb_coerce(sep)))
def removeprefix(self, prefix): return pyb(pyu(self).removeprefix(prefix))
def removesuffix(self, suffix): return pyb(pyu(self).removesuffix(suffix))
def replace(self, old, new, count=-1): return pyb(zbytes.replace(self, _pyb_coerce(old), _pyb_coerce(new), count))
# NOTE rfind/rindex & friends should return byte-position, not unicode-position
def rfind(self, sub, start=None, end=None): return zbytes.rfind(self, _pyb_coerce(sub), start, end)
def rindex(self, sub, start=None, end=None): return zbytes.rindex(self, _pyb_coerce(sub), start, end)
def rjust(self, width, fillchar=' '): return pyb(pyu(self).rjust(width, fillchar))
def rpartition(self, sep): return tuple(pyb(_) for _ in zbytes.rpartition(self, _pyb_coerce(sep)))
def rsplit(self, sep=None, maxsplit=-1):
v = pyu(self).rsplit(sep, maxsplit)
return list([pyb(_) for _ in v])
def rstrip(self, chars=None): return pyb(pyu(self).rstrip(chars))
def split(self, sep=None, maxsplit=-1):
v = pyu(self).split(sep, maxsplit)
return list([pyb(_) for _ in v])
def splitlines(self, keepends=False): return list(pyb(_) for _ in pyu(self).splitlines(keepends))
def startswith(self, prefix, start=None, end=None):
if isinstance(prefix, tuple):
for _ in prefix:
if self.startswith(_pyb_coerce(_), start, end):
return True
return False
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zbytes.startswith(self, _pyb_coerce(prefix), start, end)
def strip(self, chars=None): return pyb(pyu(self).strip(chars))
def swapcase(self): return pyb(pyu(self).swapcase())
def title(self): return pyb(pyu(self).title())
def translate(self, table, delete=None):
# bytes mode (compatibility with str/py2)
if table is None or isinstance(table, zbytes) or delete is not None:
if delete is None: delete = b''
return pyb(zbytes.translate(self, table, delete))
# unicode mode
else:
return pyb(pyu(self).translate(table))
def upper(self): return pyb(pyu(self).upper())
def zfill(self, width): return pyb(pyu(self).zfill(width))
@staticmethod
def maketrans(x=None, y=None, z=None):
return pyustr.maketrans(x, y, z)
# hand-made _pybstr.__new__ (workaround for https://github.com/cython/cython/issues/799)
cdef PyObject* _pybstr_tp_new(PyTypeObject* _cls, PyObject* _argv, PyObject* _kw) except NULL:
argv = ()
if _argv != NULL:
argv = <object>_argv
kw = {}
if _kw != NULL:
kw = <object>_kw
cdef object x = _pybstr.____new__(<object>_cls, *argv, **kw)
Py_INCREF(x)
return <PyObject*>x
(<_XPyTypeObject*>_pybstr).tp_new = &_pybstr_tp_new
# bytes uses "optimized" and custom .tp_basicsize and .tp_itemsize:
# https://github.com/python/cpython/blob/v2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L26-L32
# https://github.com/python/cpython/blob/v2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L3816-L3820
(<PyTypeObject*>_pybstr) .tp_basicsize = (<PyTypeObject*>zbytes).tp_basicsize
(<PyTypeObject*>_pybstr) .tp_itemsize = (<PyTypeObject*>zbytes).tp_itemsize
# make sure _pybstr C layout corresponds to bytes C layout exactly
# we patched cython to allow from-bytes cdef class inheritance and we also set
# .tp_basicsize directly above. All this works ok only if C layouts for _pybstr
# and bytes are completely the same.
assert sizeof(_pybstr) == sizeof(PyBytesObject)
cdef class _pyunicode(unicode):
"""_unicode is like unicode(py2)|str(py3) but can be automatically converted
to bytes via UTF-8 encoding.
The encoding always succeeds - see b for details. @no_gc
cdef class _pyustr(unicode):
"""ustr is unicode-string.
It is based on unicode and can automatically convert to/from bytes.
The conversion never fails and never looses information:
ustr → bstr → ustr
is always identity even if bytes data is not valid UTF-8.
ustr is similar to standard unicode type - iterating and accessing its
elements by [index] yields unicode characters.
ustr complements bstr and is meant to be used only in situations when
random access to string characters is needed. Otherwise bstr + uiter is
more preferable and should be enough 99% of the time.
Operations in between ustr and bstr/bytes/bytearray / unicode coerce to ustr.
When the coercion happens, bytes and bytearray, similarly to bstr, are also
treated as UTF8-encoded strings.
ustr constructor, similarly to the one in bstr, accepts arbitrary objects
and stringify them. Please refer to bstr and u documentation for details.
See also: u, bstr/b, biter/uiter.
""" """
def __bytes__(self): return pyb(self) # XXX due to "cannot `cdef class` with __new__" (https://github.com/cython/cython/issues/799)
# __unicode__ - no need # _pyustr.__new__ is hand-made in _pyustr_tp_new which invokes ↓ .____new__() .
@staticmethod
def ____new__(cls, object='', encoding=None, errors=None):
# encoding or errors -> object must expose buffer interface
if not (encoding is None and errors is None):
object = _buffer_decode(object, encoding, errors)
# _bstringify. Note: it handles bstr/ustr / unicode/bytes/bytearray as documented
object = _bstringify(object)
assert isinstance(object, (unicode, bytes)), object
uobj = _pyu(cls, object)
assert uobj is not None
return uobj
# __bytes__ converts string to bytes leaving string domain.
# see bstr.__bytes__ for more details.
def __bytes__(self): return _bdata(pyb(self)) # -> bytes
def __unicode__(self): return pyu(self) # see __str__
def __str__(self): def __str__(self):
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
return self return pyu(self) # self or pyustr if it was subclass
else: else:
return pyb(self) return pyb(self)
# initialize .tp_print for _pystr so that this type could be printed. def __repr__(self):
qself, nonascii_escape = _upysmartquote_u3b2(self)
bs = _inbstringify_get()
if bs.inbstringify == 0 or bs.inrepr:
if nonascii_escape:
qself = 'b'+qself # see bstr.__repr__
return "u(" + qself + ")"
else:
# [u('β')] goes as ['β'] when under _bstringify for %s
return qself
def __reduce_ex__(self, protocol):
return _ustr__reduce_ex__(self, protocol)
def __hash__(self):
# see _pybstr.__hash__ for why we stick to hash of current str
if PY_MAJOR_VERSION >= 3:
return zunicode.__hash__(self)
else:
return hash(pyb(self))
# == != < > <= >=
# NOTE all operations must succeed against any type.
# See bstr for details.
def __eq__(a, b):
try:
b = _pyu_coerce(b)
except TypeError:
return NotImplemented
return zunicode.__eq__(a, b)
def __ne__(a, b):
try:
b = _pyu_coerce(b)
except TypeError:
return NotImplemented
return zunicode.__ne__(a, b)
def __lt__(a, b):
try:
b = _pyu_coerce(b)
except TypeError:
return NotImplemented
return zunicode.__lt__(a, _pyu_coerce(b))
def __gt__(a, b):
try:
b = _pyu_coerce(b)
except TypeError:
return NotImplemented
return zunicode.__gt__(a, _pyu_coerce(b))
def __le__(a, b):
try:
b = _pyu_coerce(b)
except TypeError:
return NotImplemented
return zunicode.__le__(a, _pyu_coerce(b))
def __ge__(a, b):
try:
b = _pyu_coerce(b)
except TypeError:
return NotImplemented
return zunicode.__ge__(a, _pyu_coerce(b))
# len - no need to override
# [], [:]
def __getitem__(self, idx):
return pyu(zunicode.__getitem__(self, idx))
# __iter__
def __iter__(self):
if PY_MAJOR_VERSION >= 3:
return _pyustrIter(zunicode.__iter__(self))
else:
# on python 2 unicode does not have .__iter__
return PySeqIter_New(self)
# __contains__
def __contains__(self, key):
return zunicode.__contains__(self, _pyu_coerce(key))
# __add__, __radd__ (no need to override __iadd__)
def __add__(a, b):
# NOTE Cython < 3 does not automatically support __radd__ for cdef class
# https://cython.readthedocs.io/en/latest/src/userguide/migrating_to_cy30.html#arithmetic-special-methods
# see also https://github.com/cython/cython/issues/4750
if type(a) is not pyustr:
assert type(b) is pyustr
return b.__radd__(a)
try:
b = _pyu_coerce(b)
except TypeError:
if not hasattr(b, '__radd__'):
raise # don't let py2 to handle e.g. unicode + buffer automatically
return NotImplemented
return pyu(zunicode.__add__(a, b))
def __radd__(b, a):
# a.__add__(b) returned NotImplementedError, e.g. for unicode.__add__(bstr)
# u'' + u() -> u() ; same as u() + u() -> u()
# b'' + u() -> b() ; same as b() + u() -> b()
# barr + u() -> barr
if isinstance(a, bytearray):
# force `bytearray +=` to go via bytearray.sq_inplace_concat - see PyNumber_InPlaceAdd
# for pyustr this relies on patch to bytearray.sq_inplace_concat to accept ustr as bstr
return NotImplemented
a = _pybu_rcoerce(a)
return a.__add__(b)
# __mul__, __rmul__ (no need to override __imul__)
def __mul__(a, b):
if type(a) is not pyustr:
assert type(b) is pyustr
return b.__rmul__(a)
try:
_ = zunicode.__mul__(a, b)
except TypeError: # TypeError: `b` cannot be interpreted as an integer
return NotImplemented
return pyu(_)
def __rmul__(b, a):
return b.__mul__(a)
# %-formatting
def __mod__(a, b):
return pyu(pyb(a).__mod__(b))
def __rmod__(b, a):
# ("..." % x) calls "x.__rmod__()" for string subtypes
# determine output type as in __radd__
if isinstance(a, bytearray):
return NotImplemented # see bstr.__rmod__
a = _pybu_rcoerce(a)
return a.__mod__(b)
# format
def format(self, *args, **kwargs):
return pyu(_bvformat(self, args, kwargs))
def format_map(self, mapping):
return pyu(_bvformat(self, (), mapping))
def __format__(self, format_spec):
# NOTE not e.g. `_bvformat(_pyu_coerce(format_spec), (self,))` because
# the only format code that string.__format__ should support is
# 's', not e.g. 'r'.
return pyu(zunicode.__format__(self, format_spec))
# encode/decode (see bstr for details)
def encode(self, encoding=None, errors=None): # -> bytes
encoding, errors = _encoding_with_defaults(encoding, errors)
if encoding == 'utf-8' and errors == 'surrogateescape':
return _utf8_encode_surrogateescape(self)
# on py2 e.g. 'string-escape' works on bytes
if PY_MAJOR_VERSION < 3:
codec = _pycodecs_lookup_binary(encoding)
if codec is not None:
return codec.encode(pyb(self), errors)[0]
return zunicode.encode(self, encoding, errors)
def decode(self, encoding=None, errors=None): # -> ustr | bstr for encodings like string-escape
encoding, errors = _encoding_with_defaults(encoding, errors)
if encoding == 'utf-8' and errors == 'surrogateescape':
return pyu(self)
return pyb(self).decode(encoding, errors)
# all other string methods
def capitalize(self): return pyu(zunicode.capitalize(self))
def casefold(self): return pyu(zunicode.casefold(self))
def center(self, width, fillchar=' '): return pyu(zunicode.center(self, width, _pyu_coerce(fillchar)))
def count(self, sub, start=None, end=None):
# cython optimizes unicode.count to directly call PyUnicode_Count -
# - cannot use None for start/stop https://github.com/cython/cython/issues/4737
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.count(self, _pyu_coerce(sub), start, end)
def endswith(self, suffix, start=None, end=None):
if isinstance(suffix, tuple):
for _ in suffix:
if self.endswith(_pyu_coerce(_), start, end):
return True
return False
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.endswith(self, _pyu_coerce(suffix), start, end)
def expandtabs(self, tabsize=8): return pyu(zunicode.expandtabs(self, tabsize))
def find(self, sub, start=None, end=None):
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.find(self, _pyu_coerce(sub), start, end)
def index(self, sub, start=None, end=None):
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.index(self, _pyu_coerce(sub), start, end)
# isalnum(self) no need to override
# isalpha(self) no need to override
# isascii(self) no need to override
# isdecimal(self) no need to override
# isdigit(self) no need to override
# isidentifier(self) no need to override
# islower(self) no need to override
# isnumeric(self) no need to override
# isprintable(self) no need to override
# isspace(self) no need to override
# istitle(self) no need to override
def join(self, iterable): return pyu(zunicode.join(self, (_pyu_coerce(_) for _ in iterable)))
def ljust(self, width, fillchar=' '): return pyu(zunicode.ljust(self, width, _pyu_coerce(fillchar)))
def lower(self): return pyu(zunicode.lower(self))
def lstrip(self, chars=None): return pyu(zunicode.lstrip(self, _xpyu_coerce(chars)))
def partition(self, sep): return tuple(pyu(_) for _ in zunicode.partition(self, _pyu_coerce(sep)))
def removeprefix(self, prefix): return pyu(zunicode.removeprefix(self, _pyu_coerce(prefix)))
def removesuffix(self, suffix): return pyu(zunicode.removesuffix(self, _pyu_coerce(suffix)))
def replace(self, old, new, count=-1): return pyu(zunicode.replace(self, _pyu_coerce(old), _pyu_coerce(new), count))
def rfind(self, sub, start=None, end=None):
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.rfind(self, _pyu_coerce(sub), start, end)
def rindex(self, sub, start=None, end=None):
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.rindex(self, _pyu_coerce(sub), start, end)
def rjust(self, width, fillchar=' '): return pyu(zunicode.rjust(self, width, _pyu_coerce(fillchar)))
def rpartition(self, sep): return tuple(pyu(_) for _ in zunicode.rpartition(self, _pyu_coerce(sep)))
def rsplit(self, sep=None, maxsplit=-1):
v = zunicode.rsplit(self, _xpyu_coerce(sep), maxsplit)
return list([pyu(_) for _ in v])
def rstrip(self, chars=None): return pyu(zunicode.rstrip(self, _xpyu_coerce(chars)))
def split(self, sep=None, maxsplit=-1):
# cython optimizes unicode.split to directly call PyUnicode_Split - cannot use None for sep
# and cannot also use object=NULL https://github.com/cython/cython/issues/4737
if sep is None:
if PY_MAJOR_VERSION >= 3:
v = zunicode.split(self, maxsplit=maxsplit)
else:
# on py2 unicode.split does not accept keyword arguments
v = zunicode.split(self, None, maxsplit)
else:
v = zunicode.split(self, _pyu_coerce(sep), maxsplit)
return list([pyu(_) for _ in v])
def splitlines(self, keepends=False): return list(pyu(_) for _ in zunicode.splitlines(self, keepends))
def startswith(self, prefix, start=None, end=None):
if isinstance(prefix, tuple):
for _ in prefix:
if self.startswith(_pyu_coerce(_), start, end):
return True
return False
if start is None: start = 0
if end is None: end = PY_SSIZE_T_MAX
return zunicode.startswith(self, _pyu_coerce(prefix), start, end)
def strip(self, chars=None): return pyu(zunicode.strip(self, _xpyu_coerce(chars)))
def swapcase(self): return pyu(zunicode.swapcase(self))
def title(self): return pyu(zunicode.title(self))
def translate(self, table):
# unicode.translate does not accept bstr values
return pyu(zunicode.translate(self, _pyustrTranslateTab(table)))
def upper(self): return pyu(zunicode.upper(self))
def zfill(self, width): return pyu(zunicode.zfill(self, width))
@staticmethod
def maketrans(x=None, y=None, z=None):
if PY_MAJOR_VERSION >= 3:
if y is None:
# std maketrans(x) accepts only int|unicode keys
_ = {}
for k,v in x.items():
if not isinstance(k, int):
k = pyu(k)
_[k] = v
return zunicode.maketrans(_)
elif z is None:
return zunicode.maketrans(pyu(x), pyu(y)) # std maketrans does not accept b
else:
return zunicode.maketrans(pyu(x), pyu(y), pyu(z)) # ----//----
# hand-made on py2
t = {}
if y is not None:
x = pyu(x)
y = pyu(y)
if len(x) != len(y):
raise ValueError("len(x) must be == len(y))")
for (xi,yi) in zip(x,y):
t[ord(xi)] = ord(yi)
if z is not None:
z = pyu(z)
for _ in z:
t[ord(_)] = None
else:
if type(x) is not dict:
raise TypeError("sole x must be dict")
for k,v in x.iteritems():
if not isinstance(k, (int,long)):
k = ord(pyu(k))
t[k] = pyu(v)
return t
# hand-made _pyustr.__new__ (workaround for https://github.com/cython/cython/issues/799)
cdef PyObject* _pyustr_tp_new(PyTypeObject* _cls, PyObject* _argv, PyObject* _kw) except NULL:
argv = ()
if _argv != NULL:
argv = <object>_argv
kw = {}
if _kw != NULL:
kw = <object>_kw
cdef object x = _pyustr.____new__(<object>_cls, *argv, **kw)
Py_INCREF(x)
return <PyObject*>x
(<_XPyTypeObject*>_pyustr).tp_new = &_pyustr_tp_new
# similarly to bytes - want same C layout for _pyustr vs unicode
assert sizeof(_pyustr) == sizeof(PyUnicodeObject)
# _pybstrIter wraps bytes iterator to return pybstr for each yielded byte.
cdef class _pybstrIter:
cdef object zbiter
def __init__(self, zbiter):
self.zbiter = zbiter
def __iter__(self):
return self
def __next__(self):
x = next(self.zbiter)
if PY_MAJOR_VERSION >= 3:
return pybbyte(x)
else:
return pyb(x)
# _pyustrIter wraps zunicode iterator to return pyustr for each yielded character.
cdef class _pyustrIter:
cdef object zuiter
def __init__(self, zuiter):
self.zuiter = zuiter
def __iter__(self):
return self
def __next__(self):
x = next(self.zuiter)
return pyu(x)
def pybiter(obj):
"""biter(obj) is like iter(b(obj)) but TODO: iterates object incrementally
without doing full convertion to bstr."""
return iter(pyb(obj)) # TODO iterate obj directly
def pyuiter(obj):
"""uiter(obj) is like iter(u(obj)) but TODO: iterates object incrementally
without doing full convertion to ustr."""
return iter(pyu(obj)) # TODO iterate obj directly
# _pyustrTranslateTab wraps table for .translate to return bstr as unicode
# because unicode.translate does not accept bstr values.
cdef class _pyustrTranslateTab:
cdef object tab
def __init__(self, tab):
self.tab = tab
def __getitem__(self, k):
v = self.tab[k]
if not isinstance(v, int): # either unicode ordinal,
v = _xpyu_coerce(v) # character or None
return v
# _bdata/_udata retrieve raw data from bytes/unicode.
def _bdata(obj): # -> bytes
assert isinstance(obj, bytes)
_ = obj.__getnewargs__()[0] # (`bytes-data`,)
assert type(_) is bytes
return _
"""
bcopy = bytes(memoryview(obj))
assert type(bcopy) is bytes
return bcopy
"""
def _udata(obj): # -> unicode
assert isinstance(obj, unicode)
_ = obj.__getnewargs__()[0] # (`unicode-data`,)
assert type(_) is unicode
return _
"""
cdef Py_UNICODE* u = PyUnicode_AsUnicode(obj)
cdef Py_ssize_t size = PyUnicode_GetSize(obj)
cdef unicode ucopy = PyUnicode_FromUnicode(u, size)
assert type(ucopy) is unicode
return ucopy
"""
# initialize .tp_print for pybstr so that this type could be printed.
# If we don't - printing it will result in `RuntimeError: print recursion` # If we don't - printing it will result in `RuntimeError: print recursion`
# because str of this type never reaches real bytes or unicode. # because str of this type never reaches real bytes or unicode.
# Do it only on python2, because python3 does not use tp_print at all. # Do it only on python2, because python3 does not use tp_print at all.
# NOTE _pyunicode does not need this because on py2 str(_pyunicode) returns _pystr. # NOTE pyustr does not need this because on py2 str(pyustr) returns pybstr.
IF PY2: IF PY2:
# NOTE Cython does not define tp_print for PyTypeObject - do it ourselves # Cython does not define tp_print for PyTypeObject - do it ourselves
from libc.stdio cimport FILE
cdef extern from "Python.h": cdef extern from "Python.h":
ctypedef int (*printfunc)(PyObject *, FILE *, int) except -1 ctypedef int (*printfunc)(PyObject *, FILE *, int) except -1
ctypedef struct PyTypeObject: ctypedef struct _PyTypeObject_Print "PyTypeObject":
printfunc tp_print printfunc tp_print
cdef PyTypeObject *Py_TYPE(object) int Py_PRINT_RAW
cdef int _pystr_tp_print(PyObject *obj, FILE *f, int nesting) except -1: cdef int _pybstr_tp_print(PyObject *obj, FILE *f, int flags) except -1:
o = <bytes>obj o = <object>obj
o = bytes(buffer(o)) # change tp_type to bytes instead of _pystr if flags & Py_PRINT_RAW:
return Py_TYPE(o).tp_print(<PyObject*>o, f, nesting) # emit str of the object instead of repr
# https://docs.python.org/2.7/c-api/object.html#c.PyObject_Print
pass
else:
# emit repr
o = repr(o)
assert isinstance(o, bytes)
o = <bytes>o
o = bytes(buffer(o)) # change tp_type to bytes instead of pybstr
return (<_PyTypeObject_Print*>zbytes) .tp_print(<PyObject*>o, f, Py_PRINT_RAW)
(<_PyTypeObject_Print*>Py_TYPE(_pybstr())) .tp_print = _pybstr_tp_print
# whiteout .sq_slice for pybstr/pyustr inherited from str/unicode.
# This way slice access always goes through our __getitem__ implementation.
# If we don't do this e.g. bstr[:] will be handled by str.__getslice__ instead
# of bstr.__getitem__, and will return str instead of bstr.
if PY2:
(<_XPyTypeObject*>_pybstr) .tp_as_sequence.sq_slice = NULL
(<_XPyTypeObject*>_pyustr) .tp_as_sequence.sq_slice = NULL
# ---- adjust bstr/ustr classes after what cython generated ----
# change names of bstr/ustr to be e.g. "golang.bstr" instead of "golang._golang._bstr"
# this makes sure that unpickling saved bstr does not load via unpatched origin
# class, and is also generally good for saving pickle size and for reducing _golang exposure.
(<PyTypeObject*>pybstr).tp_name = "golang.bstr"
(<PyTypeObject*>pyustr).tp_name = "golang.ustr"
assert pybstr.__module__ == "golang"; assert pybstr.__name__ == "bstr"
assert pyustr.__module__ == "golang"; assert pyustr.__name__ == "ustr"
# for pybstr/pyustr cython generates .tp_dealloc that refer to bytes/unicode types directly.
# override that to refer to zbytes/zunicode to avoid infinite recursion on free
# when builtin bytes and unicode are replaced with bstr/ustr.
(<PyTypeObject*>pybstr).tp_dealloc = (<PyTypeObject*>zbytes) .tp_dealloc
(<PyTypeObject*>pyustr).tp_dealloc = (<PyTypeObject*>zunicode) .tp_dealloc
# remove unsupported bstr/ustr methods. do it outside of `cdef class` to
# workaround https://github.com/cython/cython/issues/4556 (`if ...` during
# `cdef class` is silently handled wrongly)
cdef _bstrustr_remove_unsupported_slots():
vslot = (
'casefold', # py3.3 TODO provide py2 implementation
'isidentifier', # py3 TODO provide fallback implementation
'isprintable', # py3 TODO provide fallback implementation
'removeprefix', # py3.9 TODO provide fallback implementation
'removesuffix', # py3.9 TODO provide fallback implementation
)
for slot in vslot:
if not hasattr(unicode, slot):
_patch_slot(<PyTypeObject*>pybstr, slot, DEL)
try:
_patch_slot(<PyTypeObject*>pyustr, slot, DEL)
except KeyError: # e.g. we do not define ustr.isprintable ourselves
pass
_bstrustr_remove_unsupported_slots()
Py_TYPE(_pystr()).tp_print = _pystr_tp_print
# ---- quoting ----
# _bpysmartquote_u3b2 quotes bytes/bytearray s the same way python would do for string.
#
# nonascii_escape indicates whether \xNN with NN >= 0x80 is present in the output.
#
# NOTE the return type is str type of current python, so that quoted result
# could be directly used in __repr__ or __str__ implementation.
cdef _bpysmartquote_u3b2(const byte[::1] s): # -> (unicode(py3)|bytes(py2), nonascii_escape)
# smartquotes: choose ' or " as quoting character exactly the same way python does
# https://github.com/python/cpython/blob/v2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L905-L909
cdef byte quote = ord("'")
if (quote in s) and (ord('"') not in s):
quote = ord('"')
cdef bint nonascii_escape
x = strconv._quote(s, quote, &nonascii_escape) # raw bytes
if PY_MAJOR_VERSION < 3:
return x, nonascii_escape
else:
return _utf8_decode_surrogateescape(x), nonascii_escape # raw unicode
# _upysmartquote_u3b2 is similar to _bpysmartquote_u3b2 but accepts unicode argument.
#
# NOTE the return type is str type of current python - see _bpysmartquote_u3b2 for details.
cdef _upysmartquote_u3b2(s): # -> (unicode(py3)|bytes(py2), nonascii_escape)
assert isinstance(s, unicode), s
return _bpysmartquote_u3b2(_utf8_encode_surrogateescape(s))
# qq is substitute for %q, which is missing in python. # qq is substitute for %q, which is missing in python.
...@@ -171,40 +1217,824 @@ def pyqq(obj): ...@@ -171,40 +1217,824 @@ def pyqq(obj):
# py2: unicode | str # py2: unicode | str
# py3: str | bytes # py3: str | bytes
if not isinstance(obj, (unicode, bytes)): if not isinstance(obj, (unicode, bytes)):
obj = str(obj) obj = _bstringify(obj)
return strconv.pyquote(obj)
# ---- _bstringify ----
# _bstringify returns string representation of obj.
# it is similar to unicode(obj), but handles bytes as UTF-8 encoded strings.
cdef _bstringify(object obj): # -> unicode|bytes
if type(obj) in (pybstr, pyustr):
return obj
# indicate to e.g. patched bytes.__repr__ that it is being called from under _bstringify
_bstringify_enter()
try:
if PY_MAJOR_VERSION >= 3:
# NOTE this depends on patches to bytes.{__repr__,__str__} below
return unicode(obj)
else:
# on py2 mimic manually what unicode(·) does on py3
# the reason we do it manually is because if we try just
# unicode(obj), and obj's __str__ returns UTF-8 bytestring, it will
# fail with UnicodeDecodeError. Similarly if we unconditionally do
# str(obj), it will fail if obj's __str__ returns unicode.
#
# NOTE this depends on patches to bytes.{__repr__,__str__} and
# unicode.{__repr__,__str__} below.
if hasattr(obj, '__unicode__'):
return obj.__unicode__()
elif hasattr(obj, '__str__'):
return obj.__str__()
else:
return repr(obj)
finally:
_bstringify_leave()
# _bstringify_repr returns repr of obj.
# it is similar to repr(obj), but handles bytes as UTF-8 encoded strings.
cdef _bstringify_repr(object obj): # -> unicode|bytes
_bstringify_enter_repr()
try:
return repr(obj)
finally:
_bstringify_leave_repr()
qobj = pystrconv.quote(obj) # patch bytes.{__repr__,__str__} and (py2) unicode.{__repr__,__str__}, so that both
# bytes and unicode are treated as normal strings when under _bstringify.
#
# Why:
#
# py2: str([ 'β']) -> ['\\xce\\xb2'] (1) x
# py2: str([u'β']) -> [u'\\u03b2'] (2) x
# py3: str([ 'β']) -> ['β'] (3)
# py3: str(['β'.encode()]) -> [b'\\xce\\xb2'] (4) x
#
# for us 3 is ok, while 1,2 and 4 are not. For all 1,2,3,4 we want e.g.
# `bstr(·)` or `b('%s') % ·` to give ['β']. This is fixed by patching __repr__.
#
# regarding patching __str__ - 6 and 8 in the following examples illustrate the
# need to do it:
#
# py2: str( 'β') -> 'β' (5)
# py2: str(u'β') -> UnicodeEncodeError (6) x
# py3: str( 'β') -> 'β' (7)
# py3: str('β'.encode()) -> b'\\xce\\xb2' (8) x
#
# See also overview of %-formatting.
cdef reprfunc _bytes_tp_repr = Py_TYPE(b'').tp_repr
cdef reprfunc _bytes_tp_str = Py_TYPE(b'').tp_str
cdef reprfunc _unicode_tp_repr = Py_TYPE(u'').tp_repr
cdef reprfunc _unicode_tp_str = Py_TYPE(u'').tp_str
cdef object _bytes_tp_xrepr(object s):
bs = _inbstringify_get()
if bs.inbstringify == 0:
return _bytes_tp_repr(s)
s, _ = _bpysmartquote_u3b2(s)
if PY_MAJOR_VERSION >= 3 and bs.inrepr != 0:
s = 'b'+s
return s
# `printf('%s', qq(obj))` should work. For this make sure qobj is always cdef object _bytes_tp_xstr(object s):
# a-la str type (unicode on py3, bytes on py2), that can be transparently bs = _inbstringify_get()
# converted to unicode or bytes as needed. if bs.inbstringify == 0:
return _bytes_tp_str(s)
else:
if PY_MAJOR_VERSION >= 3:
return _utf8_decode_surrogateescape(s)
else:
return s
cdef object _unicode2_tp_xrepr(object s):
bs = _inbstringify_get()
if bs.inbstringify == 0:
return _unicode_tp_repr(s)
s, _ = _upysmartquote_u3b2(s)
if PY_MAJOR_VERSION < 3 and bs.inrepr != 0:
s = 'u'+s
return s
cdef object _unicode2_tp_xstr(object s):
bs = _inbstringify_get()
if bs.inbstringify == 0:
return _unicode_tp_str(s)
else:
return s
def _bytes_x__repr__(s): return _bytes_tp_xrepr(s)
def _bytes_x__str__(s): return _bytes_tp_xstr(s)
def _unicode2_x__repr__(s): return _unicode2_tp_xrepr(s)
def _unicode2_x__str__(s): return _unicode2_tp_xstr(s)
def _():
cdef PyTypeObject* t
# NOTE patching bytes and its already-created subclasses that did not override .tp_repr/.tp_str
# NOTE if we don't also patch __dict__ - e.g. x.__repr__() won't go through patched .tp_repr
for pyt in [bytes] + bytes.__subclasses__():
assert isinstance(pyt, type)
t = <PyTypeObject*>pyt
if t.tp_repr == _bytes_tp_repr:
t.tp_repr = _bytes_tp_xrepr
_patch_slot(t, '__repr__', _bytes_x__repr__)
if t.tp_str == _bytes_tp_str:
t.tp_str = _bytes_tp_xstr
_patch_slot(t, '__str__', _bytes_x__str__)
_()
if PY_MAJOR_VERSION < 3:
def _():
cdef PyTypeObject* t
for pyt in [unicode] + unicode.__subclasses__():
assert isinstance(pyt, type)
t = <PyTypeObject*>pyt
if t.tp_repr == _unicode_tp_repr:
t.tp_repr = _unicode2_tp_xrepr
_patch_slot(t, '__repr__', _unicode2_x__repr__)
if t.tp_str == _unicode_tp_str:
t.tp_str = _unicode2_tp_xstr
_patch_slot(t, '__str__', _unicode2_x__str__)
_()
# py2: adjust unicode.tp_richcompare(a,b) to return NotImplemented if b is bstr.
# This way we avoid `UnicodeWarning: Unicode equal comparison failed to convert
# both arguments to Unicode - interpreting them as being unequal`, and that
# further `a == b` returns False even if `b == a` gives True.
#
# NOTE there is no need to do the same for ustr, because ustr inherits from
# unicode and can be always natively converted to unicode by python itself.
cdef richcmpfunc _unicode_tp_richcompare = Py_TYPE(u'').tp_richcompare
cdef object _unicode_tp_xrichcompare(object a, object b, int op):
if isinstance(b, pybstr):
return NotImplemented
return _unicode_tp_richcompare(a, b, op)
cdef object _unicode_x__eq__(object a, object b): return _unicode_tp_richcompare(a, b, Py_EQ)
cdef object _unicode_x__ne__(object a, object b): return _unicode_tp_richcompare(a, b, Py_NE)
cdef object _unicode_x__lt__(object a, object b): return _unicode_tp_richcompare(a, b, Py_LT)
cdef object _unicode_x__gt__(object a, object b): return _unicode_tp_richcompare(a, b, Py_GT)
cdef object _unicode_x__le__(object a, object b): return _unicode_tp_richcompare(a, b, Py_LE)
cdef object _unicode_x__ge__(object a, object b): return _unicode_tp_richcompare(a, b, Py_GE)
if PY_MAJOR_VERSION < 3:
def _():
cdef PyTypeObject* t
for pyt in [unicode] + unicode.__subclasses__():
assert isinstance(pyt, type)
t = <PyTypeObject*>pyt
if t.tp_richcompare == _unicode_tp_richcompare:
t.tp_richcompare = _unicode_tp_xrichcompare
_patch_slot(t, "__eq__", _unicode_x__eq__)
_patch_slot(t, "__ne__", _unicode_x__ne__)
_patch_slot(t, "__lt__", _unicode_x__lt__)
_patch_slot(t, "__gt__", _unicode_x__gt__)
_patch_slot(t, "__le__", _unicode_x__le__)
_patch_slot(t, "__ge__", _unicode_x__ge__)
_()
# patch bytearray.{__repr__,__str__} similarly to bytes, so that e.g.
# '%s' % bytearray('β') turns into β instead of bytearray(b'\xce\xb2'), and
# '%s' % [bytearray('β'] turns into ['β'] instead of [bytearray(b'\xce\xb2')].
#
# also patch:
#
# - bytearray.__init__ to accept ustr instead of raising 'TypeError:
# string argument without an encoding' (pybug: bytearray() should respect
# __bytes__ similarly to bytes)
#
# - bytearray.{sq_concat,sq_inplace_concat} to accept ustr instead of raising
# TypeError. (pybug: bytearray + and += should respect __bytes__)
cdef reprfunc _bytearray_tp_repr = (<PyTypeObject*>bytearray) .tp_repr
cdef reprfunc _bytearray_tp_str = (<PyTypeObject*>bytearray) .tp_str
cdef initproc _bytearray_tp_init = (<_XPyTypeObject*>bytearray) .tp_init
cdef binaryfunc _bytearray_sq_concat = (<_XPyTypeObject*>bytearray) .tp_as_sequence.sq_concat
cdef binaryfunc _bytearray_sq_iconcat = (<_XPyTypeObject*>bytearray) .tp_as_sequence.sq_inplace_concat
cdef object _bytearray_tp_xrepr(object a):
bs = _inbstringify_get()
if bs.inbstringify == 0:
return _bytearray_tp_repr(a)
s, _ = _bpysmartquote_u3b2(a)
if bs.inrepr != 0:
s = 'bytearray(b' + s + ')'
return s
cdef object _bytearray_tp_xstr(object a):
bs = _inbstringify_get()
if bs.inbstringify == 0:
return _bytearray_tp_str(a)
else:
if PY_MAJOR_VERSION >= 3:
return _utf8_decode_surrogateescape(a)
else:
return _bytearray_data(a)
cdef int _bytearray_tp_xinit(object self, PyObject* args, PyObject* kw) except -1:
if args != NULL and (kw == NULL or (not <object>kw)):
argv = <object>args
if isinstance(argv, tuple) and len(argv) == 1:
arg = argv[0]
if isinstance(arg, pyustr):
argv = (pyb(arg),) # NOTE argv is kept alive till end of function
args = <PyObject*>argv # no need to incref it
return _bytearray_tp_init(self, args, kw)
cdef object _bytearray_sq_xconcat(object a, object b):
if isinstance(b, pyustr):
b = pyb(b)
return _bytearray_sq_concat(a, b)
cdef object _bytearray_sq_xiconcat(object a, object b):
if isinstance(b, pyustr):
b = pyb(b)
return _bytearray_sq_iconcat(a, b)
def _bytearray_x__repr__(a): return _bytearray_tp_xrepr(a)
def _bytearray_x__str__ (a): return _bytearray_tp_xstr(a)
def _bytearray_x__init__(self, *argv, **kw):
# NOTE don't return - just call: __init__ should return None
_bytearray_tp_xinit(self, <PyObject*>argv, <PyObject*>kw)
def _bytearray_x__add__ (a, b): return _bytearray_sq_xconcat(a, b)
def _bytearray_x__iadd__(a, b): return _bytearray_sq_xiconcat(a, b)
def _():
cdef PyTypeObject* t
for pyt in [bytearray] + bytearray.__subclasses__():
assert isinstance(pyt, type)
t = <PyTypeObject*>pyt
if t.tp_repr == _bytearray_tp_repr:
t.tp_repr = _bytearray_tp_xrepr
_patch_slot(t, '__repr__', _bytearray_x__repr__)
if t.tp_str == _bytearray_tp_str:
t.tp_str = _bytearray_tp_xstr
_patch_slot(t, '__str__', _bytearray_x__str__)
t_ = <_XPyTypeObject*>t
if t_.tp_init == _bytearray_tp_init:
t_.tp_init = _bytearray_tp_xinit
_patch_slot(t, '__init__', _bytearray_x__init__)
t_sq = t_.tp_as_sequence
if t_sq.sq_concat == _bytearray_sq_concat:
t_sq.sq_concat = _bytearray_sq_xconcat
_patch_slot(t, '__add__', _bytearray_x__add__)
if t_sq.sq_inplace_concat == _bytearray_sq_iconcat:
t_sq.sq_inplace_concat = _bytearray_sq_xiconcat
_patch_slot(t, '__iadd__', _bytearray_x__iadd__)
_()
# _bytearray_data return raw data in bytearray as bytes.
# XXX `bytearray s` leads to `TypeError: Expected bytearray, got hbytearray`
cdef bytes _bytearray_data(object s):
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
qobj = _pyunicode(pyu(qobj)) return bytes(s)
else: else:
qobj = _pystr(pyb(qobj)) # on py2 bytes(s) is str(s) which invokes patched bytearray.__str__
# we want to get raw bytearray data, which is provided by unpatched bytearray.__str__
return _bytearray_tp_str(s)
# _bstringify_enter*/_bstringify_leave*/_inbstringify_get allow _bstringify* to
# indicate to further invoked code whether it has been invoked from under
# _bstringify* or not.
cdef object _inbstringify_key = "golang._inbstringify"
@final
cdef class _InBStringify:
cdef int inbstringify # >0 if we are running under _bstringify/_bstringify_repr
cdef int inrepr # >0 if we are running under _bstringify_repr
def __cinit__(self):
self.inbstringify = 0
self.inrepr = 0
cdef void _bstringify_enter() except*:
bs = _inbstringify_get()
bs.inbstringify += 1
cdef void _bstringify_leave() except*:
bs = _inbstringify_get()
bs.inbstringify -= 1
cdef void _bstringify_enter_repr() except*:
bs = _inbstringify_get()
bs.inbstringify += 1
bs.inrepr += 1
cdef void _bstringify_leave_repr() except*:
bs = _inbstringify_get()
bs.inbstringify -= 1
bs.inrepr -= 1
cdef _InBStringify _inbstringify_get():
cdef PyObject* _ts_dict = PyThreadState_GetDict() # borrowed
if _ts_dict == NULL:
raise RuntimeError("no thread state")
cdef _InBStringify ts_inbstringify
cdef PyObject* _ts_inbstrinfigy = PyDict_GetItemWithError(<object>_ts_dict, _inbstringify_key) # raises on error
if _ts_inbstrinfigy == NULL:
# key not present
ts_inbstringify = _InBStringify()
PyDict_SetItem(<object>_ts_dict, _inbstringify_key, ts_inbstringify)
else:
ts_inbstringify = <_InBStringify>_ts_inbstrinfigy
return ts_inbstringify
# _patch_slot installs func_or_descr into typ's __dict__ as name.
#
# if func_or_descr is descriptor (has __get__), it is installed as is.
# otherwise it is wrapped with "unbound method" descriptor.
#
# if func_or_descr is DEL the slot is removed from typ's __dict__.
cdef DEL = object()
cdef _patch_slot(PyTypeObject* typ, str name, object func_or_descr):
typdict = <dict>(typ.tp_dict)
#print("\npatching %s.%s with %r" % (typ.tp_name, name, func_or_descr))
#print("old: %r" % typdict.get(name))
if hasattr(func_or_descr, '__get__') or func_or_descr is DEL:
descr = func_or_descr
else:
func = func_or_descr
if PY_MAJOR_VERSION < 3:
descr = pytypes.MethodType(func, None, <object>typ)
else:
descr = _UnboundMethod(func)
if descr is DEL:
del typdict[name]
else:
typdict[name] = descr
#print("new: %r" % typdict.get(name))
PyType_Modified(typ)
cdef class _UnboundMethod(object): # they removed unbound methods on py3
cdef object func
def __init__(self, func):
self.func = func
def __get__(self, obj, objtype):
return pyfunctools.partial(self.func, obj)
# ---- % formatting ----
# When formatting string is bstr/ustr we treat bytes in all arguments as
# UTF8-encoded bytestrings. The following approach is used to implement this:
#
# 1. both bstr and ustr format via bytes-based _bprintf.
# 2. we parse the format string and handle every formatting specifier separately:
# 3. for formats besides %s/%r we use bytes.__mod__ directly.
#
# 4. for %s we stringify corresponding argument specially with all, potentially
# internal, bytes instances treated as UTF8-encoded strings:
#
# '%s' % b'\xce\xb2' -> "β"
# '%s' % [b'\xce\xb2'] -> "['β']"
#
# 5. for %r, similarly to %s, we prepare repr of corresponding argument
# specially with all, potentially internal, bytes instances also treated as
# UTF8-encoded strings:
#
# '%r' % b'\xce\xb2' -> "b'β'"
# '%r' % [b'\xce\xb2'] -> "[b'β']"
#
#
# For "2" we implement %-format parsing ourselves. test_strings_mod_and_format
# has good coverage for this phase to make sure we get it right and behaving
# exactly the same way as standard Python does.
#
# For "4" we monkey-patch bytes.__repr__ to repr bytes as strings when called
# from under bstr.__mod__(). See _bstringify for details.
#
# For "5", similarly to "4", we rely on adjustments to bytes.__repr__ .
# See _bstringify_repr for details.
#
# See also overview of patching bytes.{__repr__,__str__} near _bstringify.
cdef object _missing = object()
cdef object _atidx_re = pyre.compile('.* at index ([0-9]+)$')
cdef _bprintf(const byte[::1] fmt, xarg): # -> pybstr
cdef bytearray out = bytearray()
cdef object argv = None # if xarg is tuple or subclass
cdef object argm = None # if xarg is mapping
# https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Objects/stringobject.c#L4298-L4300
# https://github.com/python/cpython/blob/v3.11.0b1-171-g70aa1b9b912/Objects/unicodeobject.c#L14319-L14320
if _XPyMapping_Check(xarg) and \
(not isinstance(xarg, tuple)) and \
(not isinstance(xarg, (bytes,unicode))):
argm = xarg
if isinstance(xarg, tuple):
argv = xarg
xarg = _missing
#print()
#print('argv:', argv)
#print('argm:', argm)
#print('xarg:', xarg)
cdef int argv_idx = 0
def nextarg():
nonlocal argv_idx, xarg
# NOTE for `'%s %(x)s' % {'x':1}` python gives "{'x': 1} 1"
# -> so we avoid argm check completely here
#if argm is not None:
if 0:
raise ValueError('mixing dict/tuple')
elif argv is not None:
# tuple xarg
if argv_idx < len(argv):
arg = argv[argv_idx]
argv_idx += 1
return arg
elif xarg is not _missing:
# sole xarg
arg = xarg
xarg = _missing
return arg
raise TypeError('not enough arguments for format string')
def badf():
raise ValueError('incomplete format')
# parse format string locating formatting specifiers
# if we see %s/%r - use _bstringify
# else use builtin %-formatting
#
# %[(name)][flags][width|*][.[prec|*]][len](type)
#
# https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
# https://github.com/python/cpython/blob/2.7-0-g8d21aa21f2c/Objects/stringobject.c#L4266-L4765
#
# Rejected alternative: try to format; if we get "TypeError: %b requires a
# bytes-like object ..." retry with that argument converted to bstr.
#
# Rejected because e.g. for `%(x)s %(x)r` % {'x': obj}` we need to use
# access number instead of key 'x' to determine which accesses to
# bstringify. We could do that, but unfortunately on Python2 the access
# number is not easily predictable because string could be upgraded to
# unicode in the midst of being formatted and so some access keys will be
# accesses not once.
#
# Another reason for rejection: b'%r' and u'%r' handle arguments
# differently - on b %r is aliased to %a.
cdef int i = 0
cdef int l = len(fmt)
cdef byte c
while i < l:
c = fmt[i]
i += 1
if c != ord('%'):
out.append(c)
continue
fmt_istart = i-1
nameb = _missing
width = _missing
prec = _missing
value = _missing
# `c = fmt_nextchar()` avoiding https://github.com/cython/cython/issues/4798
if i >= l: badf()
c = fmt[i]; i += 1
# (name)
if c == ord('('):
#print('(name)')
if argm is None:
raise TypeError('format requires a mapping')
nparen = 1
nameb = b''
while 1:
if i >= l:
raise ValueError('incomplete format key')
c = fmt[i]; i += 1
if c == ord('('):
nparen += 1
elif c == ord(')'):
nparen -= 1
if i >= l: badf()
c = fmt[i]; i += 1
break
else:
nameb += bchr(c)
# flags
while chr(c) in '#0- +':
#print('flags')
if i >= l: badf()
c = fmt[i]; i += 1
# [width|*]
if c == ord('*'):
#print('*width')
width = nextarg()
if i >= l: badf()
c = fmt[i]; i += 1
else:
while chr(c).isdigit():
#print('width')
if i >= l: badf()
c = fmt[i]; i += 1
# [.prec|*]
if c == ord('.'):
#print('dot')
if i >= l: badf()
c = fmt[i]; i += 1
if c == ord('*'):
#print('.*')
prec = nextarg()
if i >= l: badf()
c = fmt[i]; i += 1
else:
while chr(c).isdigit():
#print('.prec')
if i >= l: badf()
c = fmt[i]; i += 1
# [len]
while chr(c) in 'hlL':
#print('len')
if i >= l: badf()
c = fmt[i]; i += 1
fmt_type = c
#print('fmt_type:', repr(chr(fmt_type)))
if fmt_type == ord('%'):
if i-2 == fmt_istart: # %%
out.append(b'%')
continue
return qobj if nameb is not _missing:
xarg = _missing # `'%(x)s %s' % {'x':1}` raises "not enough arguments"
nameu = _utf8_decode_surrogateescape(nameb)
try:
value = argm[nameb]
except KeyError:
# retry with changing key via bytes <-> unicode
# e.g. for `b('%(x)s') % {'x': ...}` builtin bytes.__mod__ will
# extract b'x' as key and raise KeyError: b'x'. We avoid that via
# retrying with second string type for key.
value = argm[nameu]
else:
# NOTE for `'%4%' % ()` python raises "not enough arguments ..."
#if fmt_type != ord('%'):
if 1:
value = nextarg()
if fmt_type == ord('%'):
raise ValueError("unsupported format character '%s' (0x%x) at index %i" % (chr(c), c, i-1))
fmt1 = memoryview(fmt[fmt_istart:i]).tobytes()
#print('fmt_istart:', fmt_istart)
#print('i: ', i)
#print(' ~> __mod__ ', repr(fmt1))
# bytes %r is aliased of %a (ASCII), but we want unicode-like %r
# -> handle it ourselves
if fmt_type == ord('r'):
value = pyb(_bstringify_repr(value))
fmt_type = ord('s')
fmt1 = fmt1[:-1] + b's'
elif fmt_type == ord('s'):
# %s -> feed value through _bstringify
# this also converts e.g. int to bstr, else e.g. on `b'%s' % 123` python
# complains '%b requires a bytes-like object ...'
value = pyb(_bstringify(value))
if nameb is not _missing:
arg = {nameb: value, nameu: value}
else:
t = []
if width is not _missing: t.append(width)
if prec is not _missing: t.append(prec)
if value is not _missing: t.append(value)
t = tuple(t)
arg = t
#print('--> __mod__ ', repr(fmt1), ' % ', repr(arg))
try:
s = zbytes.__mod__(fmt1, arg)
except ValueError as e:
# adjust position in '... at index <idx>' from fmt1 to fmt
if len(e.args) == 1:
a = e.args[0]
m = _atidx_re.match(a)
if m is not None:
a = a[:m.start(1)] + str(i-1)
e.args = (a,)
raise
out.extend(s)
if argm is None:
#print('END')
#print('argv:', argv, 'argv_idx:', argv_idx, 'xarg:', xarg)
if (argv is not None and argv_idx != len(argv)) or (xarg is not _missing):
raise TypeError("not all arguments converted during string formatting")
return pybstr(out)
# ---- .format formatting ----
# Handling .format is easier and similar to %-Formatting: we detect fields to
# format as strings via using custom string.Formatter (see _BFormatter), and
# further treat objects to stringify similarly to how %-formatting does for %s and %r.
#
# We do not need to implement format parsing ourselves, because
# string.Formatter provides it.
# _bvformat implements .format for pybstr/pyustr.
cdef _bvformat(fmt, args, kw):
return _BFormatter().vformat(fmt, args, kw)
class _BFormatter(pystring.Formatter):
def format_field(self, v, fmtspec):
#print('format_field', repr(v), repr(fmtspec))
# {} on bytes/bytearray -> treat it as bytestring
if type(v) in (bytes, bytearray):
v = pyb(v)
#print(' ~ ', repr(v))
# if the object contains bytes inside, e.g. as in [b'β'] - treat those
# internal bytes also as bytestrings
_bstringify_enter()
try:
#return super(_BFormatter, self).format_field(v, fmtspec)
x = super(_BFormatter, self).format_field(v, fmtspec)
finally:
_bstringify_leave()
#print(' ->', repr(x))
if PY_MAJOR_VERSION < 3: # py2 Formatter._vformat does does ''.join(result)
x = pyu(x) # -> we want everything in result to be unicode to avoid
# UnicodeDecodeError
return x
def convert_field(self, v, conv):
#print('convert_field', repr(v), repr(conv))
if conv == 's':
# string.Formatter does str(v) for 's'. we don't want that:
# py3: stringify, and especially treat bytes as bytestring
# py2: stringify, avoiding e.g. UnicodeEncodeError for str(unicode)
x = pyb(_bstringify(v))
elif conv == 'r':
# for bytes {!r} produces ASCII-only, but we want unicode-like !r for e.g. b'β'
# -> handle it ourselves
x = pyb(_bstringify_repr(v))
else:
x = super(_BFormatter, self).convert_field(v, conv)
#print(' ->', repr(x))
return x
# on py2 string.Formatter does not handle field autonumbering
# -> do it ourselves
if PY_MAJOR_VERSION < 3:
_autoidx = 0
_had_digit = False
def get_field(self, field_name, args, kwargs):
if field_name == '':
if self._had_digit:
raise ValueError("mixing explicit and auto numbered fields is forbidden")
field_name = str(self._autoidx)
self._autoidx += 1
elif field_name.isdigit():
self._had_digit = True
if self._autoidx != 0:
raise ValueError("mixing explicit and auto numbered fields is forbidden")
return super(_BFormatter, self).get_field(field_name, args, kwargs)
# ---- misc ----
cdef object _xpyu_coerce(obj):
return _pyu_coerce(obj) if obj is not None else None
# _buffer_py2 returns buffer(obj) on py2 / fails on py3
cdef object _buffer_py2(object obj):
IF PY2: # cannot `if PY_MAJOR_VERSION < 3` because then cython errors
return buffer(obj) # "undeclared name not builtin: buffer"
ELSE:
raise AssertionError("must be called only on py2")
# _buffer_decode decodes buf to unicode according to encoding and errors.
#
# buf must expose buffer interface.
# encoding/errors can be None meaning to use default utf-8/strict.
cdef unicode _buffer_decode(buf, encoding, errors):
if encoding is None: encoding = 'utf-8' # NOTE always UTF-8, not sys.getdefaultencoding
if errors is None: errors = 'strict'
if _XPyObject_CheckOldBuffer(buf):
buf = _buffer_py2(buf)
else:
buf = memoryview(buf)
return bytearray(buf).decode(encoding, errors)
cdef extern from "Python.h":
"""
static int _XPyObject_CheckOldBuffer(PyObject *o) {
#if PY_MAJOR_VERSION >= 3
// no old-style buffers on py3
return 0;
#else
return PyObject_CheckReadBuffer(o);
#endif
}
"""
bint _XPyObject_CheckOldBuffer(object o)
cdef extern from "Python.h":
"""
static int _XPyMapping_Check(PyObject *o) {
#if PY_MAJOR_VERSION >= 3
return PyMapping_Check(o);
#else
// on py2 PyMapping_Check besides checking tp_as_mapping->mp_subscript
// also verifies !tp_as_sequence->sq_slice. We want to avoid that
// because PyString_Format checks only tp_as_mapping->mp_subscript.
return Py_TYPE(o)->tp_as_mapping && Py_TYPE(o)->tp_as_mapping->mp_subscript;
#endif
}
"""
bint _XPyMapping_Check(object o)
# _pycodecs_lookup_binary returns codec corresponding to encoding if the codec works on binary input.
# example of such codecs are string-escape and hex encodings.
cdef _pycodecs_lookup_binary(encoding): # -> codec | None (text) | LookupError (no such encoding)
codec = pycodecs.lookup(encoding)
if not codec._is_text_encoding or \
encoding in ('string-escape',): # string-escape also works on bytes
return codec
return None
# ---- UTF-8 encode/decode ---- # ---- UTF-8 encode/decode ----
# _encoding_with_defaults returns encoding and errors substituted with defaults
# as needed for functions like ustr.encode and bstr.decode .
cdef _encoding_with_defaults(encoding, errors): # -> (encoding, errors)
if encoding is None and errors is None:
encoding = 'utf-8' # NOTE always UTF-8, not sys.getdefaultencoding
errors = 'surrogateescape'
else:
if encoding is None: encoding = 'utf-8'
if errors is None: errors = 'strict'
return (encoding, errors)
# TODO(kirr) adjust UTF-8 encode/decode surrogateescape(*) a bit so that not
# only bytes -> unicode -> bytes is always identity for any bytes (this is
# already true), but also that unicode -> bytes -> unicode is also always true
# for all unicode codepoints.
#
# The latter currently fails for all surrogate codepoints outside of U+DC80..U+DCFF range:
#
# In [1]: x = u'\udc00'
#
# In [2]: x.encode('utf-8')
# UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
#
# In [3]: x.encode('utf-8', 'surrogateescape')
# UnicodeEncodeError: 'utf-8' codec can't encode character '\udc00' in position 0: surrogates not allowed
#
# (*) aka UTF-8b (see http://hyperreal.org/~est/utf-8b/releases/utf-8b-20060413043934/kuhn-utf-8b.html)
#
# Call resulting encoding as UTF-8bk.
#
# TODO(kirr) adjust bstr pickling for protocol < 3 after switching bstr/ustr
# to decode/encode via UTF-8bk instead of UTF-8b.
from six import unichr # py2: unichr py3: chr from six import unichr # py2: unichr py3: chr
from six import int2byte as bchr # py2: chr py3: lambda x: bytes((x,)) from six import int2byte as bchr # py2: chr py3: lambda x: bytes((x,))
cdef int _rune_error = 0xFFFD # unicode replacement character
_py_rune_error = _rune_error
cdef bint _ucs2_build = (sys.maxunicode == 0xffff) # ucs2 cdef bint _ucs2_build = (sys.maxunicode == 0xffff) # ucs2
assert _ucs2_build or sys.maxunicode >= 0x0010ffff # or ucs4 assert _ucs2_build or sys.maxunicode >= 0x0010ffff # or ucs4
# _utf8_decode_rune decodes next UTF8-character from byte string s. # _utf8_decode_rune decodes next UTF8-character from byte string s.
# #
# _utf8_decode_rune(s) -> (r, size) # _utf8_decode_rune(s) -> (r, size)
def _py_utf8_decode_rune(const uint8_t[::1] s): cdef (rune, int) _utf8_decode_rune(const byte[::1] s):
return _utf8_decode_rune(s)
cdef (int, int) _utf8_decode_rune(const uint8_t[::1] s):
if len(s) == 0: if len(s) == 0:
return _rune_error, 0 return utf8.RuneError, 0
cdef int l = min(len(s), 4) # max size of an UTF-8 encoded character cdef int l = min(len(s), 4) # max size of an UTF-8 encoded character
while l > 0: while l > 0:
...@@ -231,11 +2061,11 @@ cdef (int, int) _utf8_decode_rune(const uint8_t[::1] s): ...@@ -231,11 +2061,11 @@ cdef (int, int) _utf8_decode_rune(const uint8_t[::1] s):
continue continue
# invalid UTF-8 # invalid UTF-8
return _rune_error, 1 return utf8.RuneError, 1
# _utf8_decode_surrogateescape mimics s.decode('utf-8', 'surrogateescape') from py3. # _utf8_decode_surrogateescape mimics s.decode('utf-8', 'surrogateescape') from py3.
def _utf8_decode_surrogateescape(const uint8_t[::1] s): # -> unicode cdef _utf8_decode_surrogateescape(const byte[::1] s): # -> unicode
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
if len(s) == 0: if len(s) == 0:
return u'' # avoid out-of-bounds slice access on &s[0] return u'' # avoid out-of-bounds slice access on &s[0]
...@@ -250,7 +2080,7 @@ def _utf8_decode_surrogateescape(const uint8_t[::1] s): # -> unicode ...@@ -250,7 +2080,7 @@ def _utf8_decode_surrogateescape(const uint8_t[::1] s): # -> unicode
while len(s) > 0: while len(s) > 0:
r, width = _utf8_decode_rune(s) r, width = _utf8_decode_rune(s)
if r == _rune_error and width == 1: if r == utf8.RuneError and width == 1:
b = s[0] b = s[0]
assert 0x80 <= b <= 0xff, b assert 0x80 <= b <= 0xff, b
emit(unichr(0xdc00 + b)) emit(unichr(0xdc00 + b))
...@@ -275,10 +2105,10 @@ def _utf8_decode_surrogateescape(const uint8_t[::1] s): # -> unicode ...@@ -275,10 +2105,10 @@ def _utf8_decode_surrogateescape(const uint8_t[::1] s): # -> unicode
# _utf8_encode_surrogateescape mimics s.encode('utf-8', 'surrogateescape') from py3. # _utf8_encode_surrogateescape mimics s.encode('utf-8', 'surrogateescape') from py3.
def _utf8_encode_surrogateescape(s): # -> bytes cdef _utf8_encode_surrogateescape(s): # -> bytes
assert isinstance(s, unicode) assert isinstance(s, unicode)
if PY_MAJOR_VERSION >= 3: if PY_MAJOR_VERSION >= 3:
return s.encode('UTF-8', 'surrogateescape') return zunicode.encode(s, 'UTF-8', 'surrogateescape')
# py2 does not have surrogateescape error handler, and even if we # py2 does not have surrogateescape error handler, and even if we
# provide one, builtin unicode.encode() does not treat # provide one, builtin unicode.encode() does not treat
...@@ -345,11 +2175,11 @@ else: ...@@ -345,11 +2175,11 @@ else:
# _xunichr returns unicode character for an ordinal i. # _xunichr returns unicode character for an ordinal i.
# #
# it works correctly even on ucs2 python builds, where ordinals >= 0x10000 are # it works correctly even on ucs2 python builds, where ordinals >= 0x10000 are
# represented as 2 unicode pointe. # represented as 2 unicode points.
if not _ucs2_build: cdef unicode _xunichr(rune i):
_xunichr = unichr if not _ucs2_build:
else: return unichr(i)
def _xunichr(i): else:
if i < 0x10000: if i < 0x10000:
return unichr(i) return unichr(i)
...@@ -357,3 +2187,8 @@ else: ...@@ -357,3 +2187,8 @@ else:
uh = i - 0x10000 uh = i - 0x10000
return unichr(0xd800 + (uh >> 10)) + \ return unichr(0xd800 + (uh >> 10)) + \
unichr(0xdc00 + (uh & 0x3ff)) unichr(0xdc00 + (uh & 0x3ff))
# ---- pickle ----
include '_golang_str_pickle.pyx'
# -*- coding: utf-8 -*-
# Copyright (C) 2023-2025 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""_golang_str_pickle.pyx complements _golang_str.pyx and keeps everything
related to pickling strings.
It is included from _golang_str.pyx .
"""
if PY_MAJOR_VERSION >= 3:
import copyreg as pycopyreg
else:
import copy_reg as pycopyreg
cdef object zbinary # = zodbpickle.binary | None
try:
import zodbpickle
except ImportError:
zbinary = None
else:
zbinary = zodbpickle.binary
# support for pickling bstr/ustr as standalone types.
#
# pickling is organized in such a way that
# - what is saved by py2 can be loaded correctly on both py2/py3, and similarly
# - what is saved by py3 can be loaded correctly on both py2/py3 as well.
cdef _bstr__reduce_ex__(self, protocol):
# Ideally we want to emit bstr(BYTES), but BYTES is not available for
# protocol < 3. And for protocol < 3 emitting bstr(STRING) is not an
# option because plain py3 raises UnicodeDecodeError on loading arbitrary
# STRING data. However emitting bstr(UNICODE) works universally because
# pickle supports arbitrary unicode - including invalid unicode - out of
# the box and in exactly the same way on both py2 and py3. For the
# reference upstream py3 uses surrogatepass on encode/decode UNICODE data
# to achieve that.
if protocol < 3:
# use UNICODE for data
#
# explicitly mark to unpickle via _butf8b because with the introduction
# of UTF-8bk the way bstr decodes unicode will change, and so if we
# would use `bstr UNICODE` for pickling it will result in corrupt data
# to be loaded after the switch to UTF-8bk.
#
# TODO pickle via bstr UNICODE REDUCE/NEWOBJ after switch from UTF-8b to UTF-8bk.
udata = _utf8_decode_surrogateescape(self)
if self.__class__ is pybstr:
return (_butf8b, # _butf8b UNICODE REDUCE
(udata,))
else:
return (_butf8b, # _butf8b bstr UNICODE REDUCE
(self.__class__, udata))
else:
# use BYTES for data
bdata = _bdata(self)
if PY_MAJOR_VERSION < 3:
# the only way we can get here on py2 and protocol >= 3 is zodbpickle
# -> similarly to py3 save bdata as BYTES
assert zbinary is not None
bdata = zbinary(bdata)
return (
pycopyreg.__newobj__, # bstr BYTES NEWOBJ
(self.__class__, bdata))
cdef _ustr__reduce_ex__(self, protocol):
# emit ustr(UNICODE).
# TODO after UTF-8bk we might want to switch to emitting ustr(BYTES)
# even if we do this, it should be backward compatible
if protocol < 2:
return (self.__class__, (_udata(self),))# ustr UNICODE REDUCE
else:
return (pycopyreg.__newobj__, # ustr UNICODE NEWOBJ
(self.__class__, _udata(self)))
# `_butf8b [bcls] udata` serves unpickling of bstr pickled with data
# represented via UTF-8b decoded unicode.
def _butf8b(*argv):
cdef object bcls = pybstr
cdef object udata
cdef int l = len(argv)
if l == 1:
udata = argv[0]
elif l == 2:
bcls, udata = argv
else:
raise TypeError("_butf8b() takes 1 or 2 arguments; %d given" % l)
return _pyb(bcls, _utf8_encode_surrogateescape(udata))
_butf8b.__module__ = "golang"
# -*- coding: utf-8 -*-
# cython: language_level=2
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package strconv provides Go-compatible string conversions."""
from golang cimport byte
cpdef pyquote(s)
cdef bytes _quote(const byte[::1] s, char quote, bint* out_nonascii_escape) # -> (quoted, nonascii_escape)
# -*- coding: utf-8 -*-
# cython: language_level=2
# Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""_strconv.pyx implements strconv.pyx - see _strconv.pxd for package overview."""
from __future__ import print_function, absolute_import
import unicodedata, codecs
from golang cimport pyb, byte, rune
from golang cimport _utf8_decode_rune, _xunichr
from golang.unicode cimport utf8
from cpython cimport PyObject, _PyBytes_Resize
cdef extern from "Python.h":
PyObject* PyBytes_FromStringAndSize(char*, Py_ssize_t) except NULL
char* PyBytes_AS_STRING(PyObject*)
void Py_DECREF(PyObject*)
# quote quotes unicode|bytes string into valid "..." bytestring always quoted with ".
cpdef pyquote(s): # -> bstr
cdef bint _
q = _quote(pyb(s), '"', &_)
return pyb(q)
cdef char[16] hexdigit # = '0123456789abcdef'
for i, c in enumerate('0123456789abcdef'):
hexdigit[i] = ord(c)
# XXX not possible to use `except (NULL, False)`
# (https://stackoverflow.com/a/66335433/9456786)
cdef bytes _quote(const byte[::1] s, char quote, bint* out_nonascii_escape): # -> (quoted, nonascii_escape)
# 2*" + max(4)*each byte (+ 1 for tail \0 implicitly by PyBytesObject)
cdef Py_ssize_t qmaxsize = 1 + 4*len(s) + 1
cdef PyObject* qout = PyBytes_FromStringAndSize(NULL, qmaxsize)
cdef byte* q = <byte*>PyBytes_AS_STRING(qout)
cdef bint nonascii_escape = False
cdef Py_ssize_t i = 0, j
cdef Py_ssize_t isize
cdef int size
cdef rune r
cdef byte c
q[0] = quote; q += 1
while i < len(s):
c = s[i]
# fast path - ASCII only
if c < 0x80:
if c in (ord('\\'), quote):
q[0] = ord('\\')
q[1] = c
q += 2
# printable ASCII
elif 0x20 <= c <= 0x7e:
q[0] = c
q += 1
# non-printable ASCII
elif c == ord('\t'):
q[0] = ord('\\')
q[1] = ord('t')
q += 2
elif c == ord('\n'):
q[0] = ord('\\')
q[1] = ord('n')
q += 2
elif c == ord('\r'):
q[0] = ord('\\')
q[1] = ord('r')
q += 2
# everything else is non-printable
else:
q[0] = ord('\\')
q[1] = ord('x')
q[2] = hexdigit[c >> 4]
q[3] = hexdigit[c & 0xf]
q += 4
i += 1
# slow path - full UTF-8 decoding + unicodedata
else:
r, size = _utf8_decode_rune(s[i:])
isize = i + size
# decode error - just emit raw byte as escaped
if r == utf8.RuneError and size == 1:
nonascii_escape = True
q[0] = ord('\\')
q[1] = ord('x')
q[2] = hexdigit[c >> 4]
q[3] = hexdigit[c & 0xf]
q += 4
# printable utf-8 characters go as is
elif _unicodedata_category(_xunichr(r))[0] in 'LNPS': # letters, numbers, punctuation, symbols
for j in range(i, isize):
q[0] = s[j]
q += 1
# everything else goes in numeric byte escapes
else:
nonascii_escape = True
for j in range(i, isize):
c = s[j]
q[0] = ord('\\')
q[1] = ord('x')
q[2] = hexdigit[c >> 4]
q[3] = hexdigit[c & 0xf]
q += 4
i = isize
q[0] = quote; q += 1
q[0] = 0; # don't q++ at last because size does not include tail \0
cdef Py_ssize_t qsize = (q - <byte*>PyBytes_AS_STRING(qout))
assert qsize <= qmaxsize
_PyBytes_Resize(&qout, qsize)
bqout = <bytes>qout
Py_DECREF(qout)
out_nonascii_escape[0] = nonascii_escape
return bqout
# unquote decodes "-quoted unicode|byte string.
#
# ValueError is raised if there are quoting syntax errors.
def pyunquote(s): # -> bstr
us, tail = pyunquote_next(s)
if len(tail) != 0:
raise ValueError('non-empty tail after closing "')
return us
# unquote_next decodes next "-quoted unicode|byte string.
#
# it returns -> (unquoted(s), tail-after-")
#
# ValueError is raised if there are quoting syntax errors.
def pyunquote_next(s): # -> (bstr, bstr)
us, tail = _unquote_next(pyb(s))
return pyb(us), pyb(tail)
cdef _unquote_next(s):
assert isinstance(s, bytes)
if len(s) == 0 or s[0:0+1] != b'"':
raise ValueError('no starting "')
outv = []
emit= outv.append
s = s[1:]
while 1:
r, width = _utf8_decode_rune(s)
if width == 0:
raise ValueError('no closing "')
if r == ord('"'):
s = s[1:]
break
# regular UTF-8 character
if r != ord('\\'):
emit(s[:width])
s = s[width:]
continue
if len(s) < 2:
raise ValueError('unexpected EOL after \\')
c = s[1:1+1]
# \<c> -> <c> ; c = \ "
if c in b'\\"':
emit(c)
s = s[2:]
continue
# \t \n \r
uc = None
if c == b't': uc = b'\t'
elif c == b'n': uc = b'\n'
elif c == b'r': uc = b'\r'
# accept also \a \b \v \f that Go might produce
# Python also decodes those escapes even though it does not produce them:
# https://github.com/python/cpython/blob/2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L677-L688
elif c == b'a': uc = b'\x07'
elif c == b'b': uc = b'\x08'
elif c == b'v': uc = b'\x0b'
elif c == b'f': uc = b'\x0c'
if uc is not None:
emit(uc)
s = s[2:]
continue
# \x?? hex
if c == b'x': # XXX also handle octals?
if len(s) < 2+2:
raise ValueError('unexpected EOL after \\x')
b = codecs.decode(s[2:2+2], 'hex')
emit(b)
s = s[2+2:]
continue
raise ValueError('invalid escape \\%s' % chr(ord(c[0:0+1])))
return b''.join(outv), s
cdef _unicodedata_category = unicodedata.category
#ifndef _NXD_LIBGOLANG_FMT_H #ifndef _NXD_LIBGOLANG_FMT_H
#define _NXD_LIBGOLANG_FMT_H #define _NXD_LIBGOLANG_FMT_H
// Copyright (C) 2019-2023 Nexedi SA and Contributors. // Copyright (C) 2019-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -111,7 +111,7 @@ inline error errorf(const string& format, Argv... argv) { ...@@ -111,7 +111,7 @@ inline error errorf(const string& format, Argv... argv) {
// `const char *` overloads just to catch format mistakes as // `const char *` overloads just to catch format mistakes as
// __attribute__(format) does not work with std::string. // __attribute__(format) does not work with std::string.
LIBGOLANG_API string sprintf(const char *format, ...) LIBGOLANG_API string sprintf(const char *format, ...)
#ifndef _MSC_VER #ifndef LIBGOLANG_CC_msc
__attribute__ ((format (printf, 1, 2))) __attribute__ ((format (printf, 1, 2)))
#endif #endif
; ;
......
# -*- coding: utf-8 -*-
# Copyright (C) 2022-2025 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
from __future__ import print_function, absolute_import
from golang import b, u, bstr, ustr
from golang.golang_str_test import xbytes, unicode
from pytest import raises, fixture
import io, struct
import six
# run all tests on all py/c pickle modules we aim to support
import pickle as stdPickle
if six.PY2:
import cPickle
else:
import _pickle as cPickle
from zodbpickle import slowpickle as zslowPickle
from zodbpickle import fastpickle as zfastPickle
from zodbpickle import pickle as zpickle
from zodbpickle import _pickle as _zpickle
import pickletools as stdpickletools
if six.PY2:
from zodbpickle import pickletools_2 as zpickletools
else:
from zodbpickle import pickletools_3 as zpickletools
# pickle is pytest fixture that yields all variants of pickle module.
@fixture(scope="function", params=[stdPickle, cPickle,
zslowPickle, zfastPickle, zpickle, _zpickle])
def pickle(request):
yield request.param
# pickletools is pytest fixture that yields all variants of pickletools module.
@fixture(scope="function", params=[stdpickletools, zpickletools])
def pickletools(request):
yield request.param
# pickle2tools returns pickletools module that corresponds to module pickle.
def pickle2tools(pickle):
if pickle in (stdPickle, cPickle):
return stdpickletools
else:
return zpickletools
# verify that loading *UNICODE opcodes loads them as unicode/ustr.
# this is standard behaviour but we verify it since we will patch pickle's strings processing.
# also verify save lightly for symmetry.
def test_strings_pickle_loadsave_UNICODE(pickle):
# NOTE builtin pickle behaviour is to save unicode via 'surrogatepass' error handler
# this means that b'мир\xff' -> ustr/unicode -> save will emit *UNICODE with
# b'мир\xed\xb3\xbf' instead of b'мир\xff' as data.
p_uni = b'V\\u043c\\u0438\\u0440\\udcff\n.' # UNICODE 'мир\uDCFF'
p_binu = b'X\x09\x00\x00\x00\xd0\xbc\xd0\xb8\xd1\x80\xed\xb3\xbf.' # BINUNICODE NOTE ...edb3bf not ...ff
p_sbinu = b'\x8c\x09\xd0\xbc\xd0\xb8\xd1\x80\xed\xb3\xbf.' # SHORT_BINUNICODE
p_binu8 = b'\x8d\x09\x00\x00\x00\x00\x00\x00\x00\xd0\xbc\xd0\xb8\xd1\x80\xed\xb3\xbf.' # BINUNICODE8
u_obj = u'мир\uDCFF'; assert type(u_obj) is unicode
# load: check invokes f on all test pickles that pickle should support
def check(f):
f(p_uni)
f(p_binu)
if HIGHEST_PROTOCOL(pickle) >= 4:
f(p_sbinu)
f(p_binu8)
def _(p):
obj = xloads(pickle, p)
assert type(obj) is unicode
assert obj == u_obj
check(_)
# save
def dumps(proto):
return xdumps(pickle, u_obj, proto)
assert dumps(0) == p_uni
assert dumps(1) == p_binu
assert dumps(2) == p_binu
if HIGHEST_PROTOCOL(pickle) >= 3:
assert dumps(3) == p_binu
if HIGHEST_PROTOCOL(pickle) >= 4:
assert dumps(4) == p_sbinu
# verify that bstr/ustr can be pickled/unpickled correctly.
def test_strings_pickle_bstr_ustr(pickle):
bs = b(xbytes('мир')+b'\xff')
us = u(xbytes('май')+b'\xff')
def diss(p): return xdiss(pickle2tools(pickle), p)
def dis(p): print(diss(p))
# assert_pickle verifies that pickling obj results in dumps_ok
# and that unpickling results back in obj.
assert HIGHEST_PROTOCOL(pickle) <= 5
def assert_pickle(obj, proto, dumps_ok):
if proto > HIGHEST_PROTOCOL(pickle):
with raises(ValueError):
xdumps(pickle, obj, proto)
return
p = xdumps(pickle, obj, proto)
assert p == dumps_ok, diss(p)
#dis(p)
obj2 = xloads(pickle, p)
assert type(obj2) is type(obj)
assert obj2 == obj
_ = assert_pickle
_(bs, 0,
b"cgolang\n_butf8b\n(V\\u043c\\u0438\\u0440\\udcff\ntR.") # _butf8b(UNICODE)
_(us, 0,
b'cgolang\nustr\n(V\\u043c\\u0430\\u0439\\udcff\ntR.') # ustr(UNICODE)
_(bs, 1,
b'cgolang\n_butf8b\n(X\x09\x00\x00\x00' # _butf8b(BINUNICODE)
b'\xd0\xbc\xd0\xb8\xd1\x80\xed\xb3\xbftR.')
# NOTE BINUNICODE ...edb3bf not ...ff (see test_strings_pickle_loadsave_UNICODE for details)
_(us, 1,
b'cgolang\nustr\n(X\x09\x00\x00\x00' # ustr(BINUNICODE)
b'\xd0\xbc\xd0\xb0\xd0\xb9\xed\xb3\xbftR.')
_(bs, 2,
b'cgolang\n_butf8b\nX\x09\x00\x00\x00' # _butf8b(BINUNICODE)
b'\xd0\xbc\xd0\xb8\xd1\x80\xed\xb3\xbf\x85R.')
_(us, 2,
b'cgolang\nustr\nX\x09\x00\x00\x00' # ustr(BINUNICODE)
b'\xd0\xbc\xd0\xb0\xd0\xb9\xed\xb3\xbf\x85\x81.')
_(bs, 3,
b'cgolang\nbstr\nC\x07\xd0\xbc\xd0\xb8\xd1\x80\xff\x85\x81.') # bstr(SHORT_BINBYTES)
_(us, 3,
b'cgolang\nustr\nX\x09\x00\x00\x00' # ustr(BINUNICODE)
b'\xd0\xbc\xd0\xb0\xd0\xb9\xed\xb3\xbf\x85\x81.')
for p in (4,5):
_(bs, p,
b'\x8c\x06golang\x8c\x04bstr\x93C\x07' # bstr(SHORT_BINBYTES)
b'\xd0\xbc\xd0\xb8\xd1\x80\xff\x85\x81.')
_(us, p,
b'\x8c\x06golang\x8c\x04ustr\x93\x8c\x09' # ustr(SHORT_BINUNICODE)
b'\xd0\xbc\xd0\xb0\xd0\xb9\xed\xb3\xbf\x85\x81.')
# ---- disassembly ----
# xdiss returns disassembly of a pickle as string.
def xdiss(pickletools, p): # -> str
out = six.StringIO()
pickletools.dis(p, out)
return out.getvalue()
# ---- loads and normalized dumps ----
# xloads loads pickle p via pickle.loads
# it also verifies that .load and Unpickler.load give the same result.
def xloads(pickle, p, **kw):
obj1 = _xpickle_attr(pickle, 'loads')(p, **kw)
obj2 = _xpickle_attr(pickle, 'load') (io.BytesIO(p), **kw)
obj3 = _xpickle_attr(pickle, 'Unpickler')(io.BytesIO(p), **kw).load()
assert type(obj2) is type(obj1)
assert type(obj3) is type(obj1)
assert obj1 == obj2 == obj3
return obj1
# xdumps dumps obj via pickle.dumps
# it also verifies that .dump and Pickler.dump give the same.
def xdumps(pickle, obj, proto, **kw):
p1 = _xpickle_attr(pickle, 'dumps')(obj, proto, **kw)
f2 = io.BytesIO(); _xpickle_attr(pickle, 'dump')(obj, f2, proto, **kw)
p2 = f2.getvalue()
f3 = io.BytesIO(); _xpickle_attr(pickle, 'Pickler')(f3, proto, **kw).dump(obj)
p3 = f3.getvalue()
assert type(p1) is bytes
assert type(p2) is bytes
assert type(p3) is bytes
assert p1 == p2 == p3
# remove not interesting parts: PROTO / FRAME header and unused PUTs
if proto >= 2:
assert p1.startswith(PROTO(proto))
return pickle_normalize(pickle2tools(pickle), p1)
def _xpickle_attr(pickle, name):
# on py3 pickle.py tries to import from C _pickle to optimize by default
# -> verify py version if we are asked to test pickle.py
if six.PY3 and (pickle is stdPickle):
assert getattr(pickle, name) is getattr(cPickle, name)
name = '_'+name
return getattr(pickle, name)
# pickle_normalize returns normalized version of pickle p.
#
# - PROTO and FRAME opcodes are removed from header,
# - unused PUT, BINPUT and MEMOIZE opcodes - those without corresponding GET are removed,
# - *PUT indices start from 0 (this unifies cPickle with pickle).
def pickle_normalize(pickletools, p):
def iter_pickle(p): # -> i(op, arg, pdata)
op_prev = None
arg_prev = None
pos_prev = None
for op, arg, pos in pickletools.genops(p):
if op_prev is not None:
pdata_prev = p[pos_prev:pos]
yield (op_prev, arg_prev, pdata_prev)
op_prev = op
arg_prev = arg
pos_prev = pos
if op_prev is not None:
yield (op_prev, arg_prev, p[pos_prev:])
memo_oldnew = {} # idx used in original pop/get -> new index | None if not get
idx = 0
for op, arg, pdata in iter_pickle(p):
if 'PUT' in op.name:
memo_oldnew.setdefault(arg, None)
elif 'MEMOIZE' in op.name:
memo_oldnew.setdefault(len(memo_oldnew), None)
elif 'GET' in op.name:
if memo_oldnew.get(arg) is None:
memo_oldnew[arg] = idx
idx += 1
pout = b''
memo_old = set() # idx used in original pop
for op, arg, pdata in iter_pickle(p):
if op.name in ('PROTO', 'FRAME'):
continue
if 'PUT' in op.name:
memo_old.add(arg)
newidx = memo_oldnew.get(arg)
if newidx is None:
continue
pdata = globals()[op.name](newidx)
if 'MEMOIZE' in op.name:
idx = len(memo_old)
memo_old.add(idx)
newidx = memo_oldnew.get(idx)
if newidx is None:
continue
if 'GET' in op.name:
newidx = memo_oldnew[arg]
assert newidx is not None
pdata = globals()[op.name](newidx)
pout += pdata
return pout
P = struct.pack
def PROTO(version): return b'\x80' + P('<B', version)
def FRAME(size): return b'\x95' + P('<Q', size)
def GET(idx): return b'g%d\n' % (idx,)
def PUT(idx): return b'p%d\n' % (idx,)
def BINPUT(idx): return b'q' + P('<B', idx)
def BINGET(idx): return b'h' + P('<B', idx)
def LONG_BINPUT(idx): return b'r' + P('<I', idx)
def LONG_BINGET(idx): return b'j' + P('<I', idx)
MEMOIZE = b'\x94'
def test_pickle_normalize(pickletools):
def diss(p):
return xdiss(pickletools, p)
proto = 0
for op in pickletools.opcodes:
proto = max(proto, op.proto)
assert proto >= 2
def _(p, p_normok):
p_norm = pickle_normalize(pickletools, p)
assert p_norm == p_normok, diss(p_norm)
_(b'.', b'.')
_(b'I1\n.', b'I1\n.')
_(PROTO(2)+b'I1\n.', b'I1\n.')
putgetv = [(PUT,GET), (BINPUT, BINGET)]
if proto >= 4:
putgetv.append((LONG_BINPUT, LONG_BINGET))
for (put,get) in putgetv:
_(b'(I1\n'+put(1) + b'I2\n'+put(2) +b't'+put(3)+b'0'+get(3)+put(4)+b'.',
b'(I1\nI2\nt'+put(0)+b'0'+get(0)+b'.')
if proto >= 4:
_(FRAME(4)+b'I1\n.', b'I1\n.')
_(b'I1\n'+MEMOIZE+b'I2\n'+MEMOIZE+GET(0)+b'.',
b'I1\n'+MEMOIZE+b'I2\n'+GET(0)+b'.')
# ---- misc ----
# HIGHEST_PROTOCOL returns highest protocol supported by pickle.
def HIGHEST_PROTOCOL(pickle):
if six.PY3 and pickle is cPickle:
pmax = stdPickle.HIGHEST_PROTOCOL # py3: _pickle has no .HIGHEST_PROTOCOL
elif six.PY3 and pickle is _zpickle:
pmax = zpickle.HIGHEST_PROTOCOL # ----//---- for _zpickle
else:
pmax = pickle.HIGHEST_PROTOCOL
assert pmax >= 2
return pmax
This source diff could not be displayed because it is too large. You can view the blob instead.
...@@ -169,6 +169,8 @@ ...@@ -169,6 +169,8 @@
// [1] Libtask: a Coroutine Library for C and Unix. https://swtch.com/libtask. // [1] Libtask: a Coroutine Library for C and Unix. https://swtch.com/libtask.
// [2] http://9p.io/magic/man2html/2/thread. // [2] http://9p.io/magic/man2html/2/thread.
#include "golang/runtime/platform.h"
#include <stdbool.h> #include <stdbool.h>
#include <stddef.h> #include <stddef.h>
#include <stdint.h> #include <stdint.h>
...@@ -177,21 +179,18 @@ ...@@ -177,21 +179,18 @@
#include <sys/stat.h> #include <sys/stat.h>
#include <fcntl.h> #include <fcntl.h>
#ifdef _MSC_VER // no mode_t on msvc #ifdef LIBGOLANG_CC_msc // no mode_t on msvc
typedef int mode_t; typedef int mode_t;
#endif #endif
// DSO symbols visibility (based on https://gcc.gnu.org/wiki/Visibility) // DSO symbols visibility (based on https://gcc.gnu.org/wiki/Visibility)
#if defined _WIN32 || defined __CYGWIN__ #ifdef LIBGOLANG_OS_windows
#define LIBGOLANG_DSO_EXPORT __declspec(dllexport) #define LIBGOLANG_DSO_EXPORT __declspec(dllexport)
#define LIBGOLANG_DSO_IMPORT __declspec(dllimport) #define LIBGOLANG_DSO_IMPORT __declspec(dllimport)
#elif __GNUC__ >= 4 #else
#define LIBGOLANG_DSO_EXPORT __attribute__ ((visibility ("default"))) #define LIBGOLANG_DSO_EXPORT __attribute__ ((visibility ("default")))
#define LIBGOLANG_DSO_IMPORT __attribute__ ((visibility ("default"))) #define LIBGOLANG_DSO_IMPORT __attribute__ ((visibility ("default")))
#else
#define LIBGOLANG_DSO_EXPORT
#define LIBGOLANG_DSO_IMPORT
#endif #endif
#if BUILDING_LIBGOLANG #if BUILDING_LIBGOLANG
...@@ -438,6 +437,10 @@ constexpr Nil nil = nullptr; ...@@ -438,6 +437,10 @@ constexpr Nil nil = nullptr;
// string is alias for std::string. // string is alias for std::string.
using string = std::string; using string = std::string;
// byte/rune types related to string.
using byte = uint8_t;
using rune = int32_t;
// func is alias for std::function. // func is alias for std::function.
template<typename F> template<typename F>
using func = std::function<F>; using func = std::function<F>;
......
// Copyright (C) 2019-2023 Nexedi SA and Contributors. // Copyright (C) 2019-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -38,7 +38,7 @@ ...@@ -38,7 +38,7 @@
// cut this short // cut this short
// (on darwing sys_siglist declaration is normally provided) // (on darwing sys_siglist declaration is normally provided)
// (on windows sys_siglist is not available at all) // (on windows sys_siglist is not available at all)
#if !(defined(__APPLE__) || defined(_WIN32)) #if !(defined(LIBGOLANG_OS_darwin) || defined(LIBGOLANG_OS_windows))
extern "C" { extern "C" {
extern const char * const sys_siglist[]; extern const char * const sys_siglist[];
} }
...@@ -287,7 +287,7 @@ string Signal::String() const { ...@@ -287,7 +287,7 @@ string Signal::String() const {
const Signal& sig = *this; const Signal& sig = *this;
const char *sigstr = nil; const char *sigstr = nil;
#ifdef _WIN32 #ifdef LIBGOLANG_OS_windows
switch (sig.signo) { switch (sig.signo) {
case SIGABRT: return "Aborted"; case SIGABRT: return "Aborted";
case SIGBREAK: return "Break"; case SIGBREAK: return "Break";
......
#ifndef _NXD_LIBGOLANG_OS_H #ifndef _NXD_LIBGOLANG_OS_H
#define _NXD_LIBGOLANG_OS_H #define _NXD_LIBGOLANG_OS_H
// //
// Copyright (C) 2019-2023 Nexedi SA and Contributors. // Copyright (C) 2019-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -96,7 +96,7 @@ private: ...@@ -96,7 +96,7 @@ private:
// Open opens file @path. // Open opens file @path.
LIBGOLANG_API std::tuple<File, error> Open(const string &path, int flags = O_RDONLY, LIBGOLANG_API std::tuple<File, error> Open(const string &path, int flags = O_RDONLY,
mode_t mode = mode_t mode =
#if !defined(_MSC_VER) #if !defined(LIBGOLANG_CC_msc)
S_IRUSR | S_IWUSR | S_IXUSR | S_IRUSR | S_IWUSR | S_IXUSR |
S_IRGRP | S_IWGRP | S_IXGRP | S_IRGRP | S_IWGRP | S_IXGRP |
S_IROTH | S_IWOTH | S_IXOTH S_IROTH | S_IWOTH | S_IXOTH
......
// Copyright (C) 2021-2023 Nexedi SA and Contributors. // Copyright (C) 2021-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -89,7 +89,7 @@ ...@@ -89,7 +89,7 @@
#include <atomic> #include <atomic>
#include <tuple> #include <tuple>
#if defined(_WIN32) #if defined(LIBGOLANG_OS_windows)
# include <windows.h> # include <windows.h>
#endif #endif
...@@ -101,7 +101,7 @@ ...@@ -101,7 +101,7 @@
# define debugf(format, ...) do {} while (0) # define debugf(format, ...) do {} while (0)
#endif #endif
#if defined(_MSC_VER) #ifdef LIBGOLANG_CC_msc
# define HAVE_SIGACTION 0 # define HAVE_SIGACTION 0
#else #else
# define HAVE_SIGACTION 1 # define HAVE_SIGACTION 1
...@@ -194,7 +194,7 @@ void _init() { ...@@ -194,7 +194,7 @@ void _init() {
if (err != nil) if (err != nil)
panic("os::newFile(_wakerx"); panic("os::newFile(_wakerx");
_waketx = vfd[1]; _waketx = vfd[1];
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
if (sys::Fcntl(_waketx, F_SETFL, O_NONBLOCK) < 0) if (sys::Fcntl(_waketx, F_SETFL, O_NONBLOCK) < 0)
panic("fcntl(_waketx, O_NONBLOCK)"); // TODO +syserr panic("fcntl(_waketx, O_NONBLOCK)"); // TODO +syserr
#else #else
......
# Copyright (C) 2019-2023 Nexedi SA and Contributors. # Copyright (C) 2019-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -212,9 +212,11 @@ def _with_build_defaults(name, kw): # -> (pygo, kw') ...@@ -212,9 +212,11 @@ def _with_build_defaults(name, kw): # -> (pygo, kw')
dependv = kw.get('depends', [])[:] dependv = kw.get('depends', [])[:]
dependv.extend(['%s/golang/%s' % (pygo, _) for _ in [ dependv.extend(['%s/golang/%s' % (pygo, _) for _ in [
'libgolang.h', 'libgolang.h',
'runtime.h',
'runtime/internal.h', 'runtime/internal.h',
'runtime/internal/atomic.h', 'runtime/internal/atomic.h',
'runtime/internal/syscall.h', 'runtime/internal/syscall.h',
'runtime/platform.h',
'context.h', 'context.h',
'cxx.h', 'cxx.h',
'errors.h', 'errors.h',
...@@ -226,6 +228,7 @@ def _with_build_defaults(name, kw): # -> (pygo, kw') ...@@ -226,6 +228,7 @@ def _with_build_defaults(name, kw): # -> (pygo, kw')
'os.h', 'os.h',
'os/signal.h', 'os/signal.h',
'pyx/runtime.h', 'pyx/runtime.h',
'unicode/utf8.h',
'_testing.h', '_testing.h',
'_compat/windows/strings.h', '_compat/windows/strings.h',
'_compat/windows/unistd.h', '_compat/windows/unistd.h',
...@@ -264,6 +267,8 @@ def Extension(name, sources, **kw): ...@@ -264,6 +267,8 @@ def Extension(name, sources, **kw):
'_fmt.pxd', '_fmt.pxd',
'io.pxd', 'io.pxd',
'_io.pxd', '_io.pxd',
'strconv.pxd',
'_strconv.pxd',
'strings.pxd', 'strings.pxd',
'sync.pxd', 'sync.pxd',
'_sync.pxd', '_sync.pxd',
...@@ -274,6 +279,8 @@ def Extension(name, sources, **kw): ...@@ -274,6 +279,8 @@ def Extension(name, sources, **kw):
'os/signal.pxd', 'os/signal.pxd',
'os/_signal.pxd', 'os/_signal.pxd',
'pyx/runtime.pxd', 'pyx/runtime.pxd',
'unicode/utf8.pxd',
'unicode/_utf8.pxd',
]]) ]])
kw['depends'] = dependv kw['depends'] = dependv
......
// Copyright (C) 2023-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package runtime mirrors Go package runtime.
// See runtime.h for package overview.
#include "golang/runtime.h"
// golang::runtime::
namespace golang {
namespace runtime {
const string OS =
#ifdef LIBGOLANG_OS_linux
"linux"
#elif defined(LIBGOLANG_OS_darwin)
"darwin"
#elif defined(LIBGOLANG_OS_windows)
"windows"
#else
# error
#endif
;
const string CC =
#ifdef LIBGOLANG_CC_gcc
"gcc"
#elif defined(LIBGOLANG_CC_clang)
"clang"
#elif defined(LIBGOLANG_CC_msc)
"msc"
#else
# error
#endif
;
}} // golang::runtime::
#ifndef _NXD_LIBGOLANG_RUNTIME_H
#define _NXD_LIBGOLANG_RUNTIME_H
// Copyright (C) 2023-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package runtime mirrors Go package runtime.
#include "golang/libgolang.h"
// golang::runtime::
namespace golang {
namespace runtime {
// OS indicates operating system, that is running the program.
//
// e.g. "linux", "darwin", "windows", ...
extern LIBGOLANG_API const string OS;
// CC indicates C/C++ compiler, that compiled the program.
//
// e.g. "gcc", "clang", "msc", ...
extern LIBGOLANG_API const string CC;
}} // golang::runtime::
#endif // _NXD_LIBGOLANG_RUNTIME_H
...@@ -40,7 +40,7 @@ ELSE: ...@@ -40,7 +40,7 @@ ELSE:
from gevent import sleep as pygsleep from gevent import sleep as pygsleep
from libc.stdint cimport uint8_t, uint64_t, UINT64_MAX from libc.stdint cimport uint64_t, UINT64_MAX
cdef extern from *: cdef extern from *:
ctypedef bint cbool "bool" ctypedef bint cbool "bool"
...@@ -52,7 +52,7 @@ from golang.runtime._libgolang cimport _libgolang_runtime_ops, _libgolang_sema, ...@@ -52,7 +52,7 @@ from golang.runtime._libgolang cimport _libgolang_runtime_ops, _libgolang_sema,
from golang.runtime.internal cimport syscall from golang.runtime.internal cimport syscall
from golang.runtime cimport _runtime_thread from golang.runtime cimport _runtime_thread
from golang.runtime._runtime_pymisc cimport PyExc, pyexc_fetch, pyexc_restore from golang.runtime._runtime_pymisc cimport PyExc, pyexc_fetch, pyexc_restore
from golang cimport topyexc from golang cimport byte, topyexc
from libc.stdlib cimport calloc, free from libc.stdlib cimport calloc, free
from libc.errno cimport EBADF from libc.errno cimport EBADF
...@@ -351,7 +351,7 @@ cdef nogil: ...@@ -351,7 +351,7 @@ cdef nogil:
cdef: cdef:
bint _io_read(IOH* ioh, int* out_n, void *buf, size_t count): bint _io_read(IOH* ioh, int* out_n, void *buf, size_t count):
pygfobj = <object>ioh.pygfobj pygfobj = <object>ioh.pygfobj
cdef uint8_t[::1] mem = <uint8_t[:count]>buf cdef byte[::1] mem = <byte[:count]>buf
xmem = memoryview(mem) # to avoid https://github.com/cython/cython/issues/3900 on mem[:0]=b'' xmem = memoryview(mem) # to avoid https://github.com/cython/cython/issues/3900 on mem[:0]=b''
try: try:
# NOTE buf might be on stack, so it must not be accessed, e.g. from # NOTE buf might be on stack, so it must not be accessed, e.g. from
...@@ -388,7 +388,7 @@ cdef nogil: ...@@ -388,7 +388,7 @@ cdef nogil:
cdef: cdef:
bint _io_write(IOH* ioh, int* out_n, const void *buf, size_t count): bint _io_write(IOH* ioh, int* out_n, const void *buf, size_t count):
pygfobj = <object>ioh.pygfobj pygfobj = <object>ioh.pygfobj
cdef const uint8_t[::1] mem = <const uint8_t[:count]>buf cdef const byte[::1] mem = <const byte[:count]>buf
# NOTE buf might be on stack, so it must not be accessed, e.g. from # NOTE buf might be on stack, so it must not be accessed, e.g. from
# FileObjectThread, while our greenlet is parked (see STACK_DEAD_WHILE_PARKED # FileObjectThread, while our greenlet is parked (see STACK_DEAD_WHILE_PARKED
......
// Copyright (C) 2022-2023 Nexedi SA and Contributors. // Copyright (C) 2022-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -20,7 +20,7 @@ ...@@ -20,7 +20,7 @@
#include "golang/runtime/internal/atomic.h" #include "golang/runtime/internal/atomic.h"
#include "golang/libgolang.h" #include "golang/libgolang.h"
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
#include <pthread.h> #include <pthread.h>
#endif #endif
...@@ -44,7 +44,7 @@ static void _forkNewEpoch() { ...@@ -44,7 +44,7 @@ static void _forkNewEpoch() {
void _init() { void _init() {
// there is no fork on windows // there is no fork on windows
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
int e = pthread_atfork(/*prepare*/nil, /*inparent*/nil, /*inchild*/_forkNewEpoch); int e = pthread_atfork(/*prepare*/nil, /*inparent*/nil, /*inchild*/_forkNewEpoch);
if (e != 0) if (e != 0)
panic("pthread_atfork failed"); panic("pthread_atfork failed");
......
// Copyright (C) 2021-2023 Nexedi SA and Contributors. // Copyright (C) 2021-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -58,9 +58,9 @@ string _Errno::Error() { ...@@ -58,9 +58,9 @@ string _Errno::Error() {
char ebuf[128]; char ebuf[128];
bool ok; bool ok;
#if __APPLE__ #ifdef LIBGOLANG_OS_darwin
ok = (::strerror_r(-e.syserr, ebuf, sizeof(ebuf)) == 0); ok = (::strerror_r(-e.syserr, ebuf, sizeof(ebuf)) == 0);
#elif defined(_WIN32) #elif defined(LIBGOLANG_OS_windows)
ok = (::strerror_s(ebuf, sizeof(ebuf), -e.syserr) == 0); ok = (::strerror_s(ebuf, sizeof(ebuf), -e.syserr) == 0);
#else #else
char *estr = ::strerror_r(-e.syserr, ebuf, sizeof(ebuf)); char *estr = ::strerror_r(-e.syserr, ebuf, sizeof(ebuf));
...@@ -102,7 +102,7 @@ __Errno Close(int fd) { ...@@ -102,7 +102,7 @@ __Errno Close(int fd) {
return err; return err;
} }
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
__Errno Fcntl(int fd, int cmd, int arg) { __Errno Fcntl(int fd, int cmd, int arg) {
int save_errno = errno; int save_errno = errno;
int err = ::fcntl(fd, cmd, arg); int err = ::fcntl(fd, cmd, arg);
...@@ -124,7 +124,7 @@ __Errno Fstat(int fd, struct ::stat *out_st) { ...@@ -124,7 +124,7 @@ __Errno Fstat(int fd, struct ::stat *out_st) {
int Open(const char *path, int flags, mode_t mode) { int Open(const char *path, int flags, mode_t mode) {
int save_errno = errno; int save_errno = errno;
#ifdef _WIN32 // default to open files in binary mode #ifdef LIBGOLANG_OS_windows // default to open files in binary mode
if ((flags & (_O_TEXT | _O_BINARY)) == 0) if ((flags & (_O_TEXT | _O_BINARY)) == 0)
flags |= _O_BINARY; flags |= _O_BINARY;
#endif #endif
...@@ -141,9 +141,9 @@ __Errno Pipe(int vfd[2], int flags) { ...@@ -141,9 +141,9 @@ __Errno Pipe(int vfd[2], int flags) {
return -EINVAL; return -EINVAL;
int save_errno = errno; int save_errno = errno;
int err; int err;
#ifdef __linux__ #ifdef LIBGOLANG_OS_linux
err = ::pipe2(vfd, flags); err = ::pipe2(vfd, flags);
#elif defined(_WIN32) #elif defined(LIBGOLANG_OS_windows)
err = ::_pipe(vfd, 4096, flags | _O_BINARY); err = ::_pipe(vfd, 4096, flags | _O_BINARY);
#else #else
err = ::pipe(vfd); err = ::pipe(vfd);
...@@ -167,7 +167,7 @@ out: ...@@ -167,7 +167,7 @@ out:
return err; return err;
} }
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
__Errno Sigaction(int signo, const struct ::sigaction *act, struct ::sigaction *oldact) { __Errno Sigaction(int signo, const struct ::sigaction *act, struct ::sigaction *oldact) {
int save_errno = errno; int save_errno = errno;
int err = ::sigaction(signo, act, oldact); int err = ::sigaction(signo, act, oldact);
......
#ifndef _NXD_LIBGOLANG_RUNTIME_INTERNAL_SYSCALL_H #ifndef _NXD_LIBGOLANG_RUNTIME_INTERNAL_SYSCALL_H
#define _NXD_LIBGOLANG_RUNTIME_INTERNAL_SYSCALL_H #define _NXD_LIBGOLANG_RUNTIME_INTERNAL_SYSCALL_H
// Copyright (C) 2021-2023 Nexedi SA and Contributors. // Copyright (C) 2021-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com> // Kirill Smelkov <kirr@nexedi.com>
// //
// This program is free software: you can Use, Study, Modify and Redistribute // This program is free software: you can Use, Study, Modify and Redistribute
...@@ -63,13 +63,13 @@ LIBGOLANG_API int/*n|err*/ Read(int fd, void *buf, size_t count); ...@@ -63,13 +63,13 @@ LIBGOLANG_API int/*n|err*/ Read(int fd, void *buf, size_t count);
LIBGOLANG_API int/*n|err*/ Write(int fd, const void *buf, size_t count); LIBGOLANG_API int/*n|err*/ Write(int fd, const void *buf, size_t count);
LIBGOLANG_API __Errno Close(int fd); LIBGOLANG_API __Errno Close(int fd);
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
LIBGOLANG_API __Errno Fcntl(int fd, int cmd, int arg); LIBGOLANG_API __Errno Fcntl(int fd, int cmd, int arg);
#endif #endif
LIBGOLANG_API __Errno Fstat(int fd, struct ::stat *out_st); LIBGOLANG_API __Errno Fstat(int fd, struct ::stat *out_st);
LIBGOLANG_API int/*fd|err*/ Open(const char *path, int flags, mode_t mode); LIBGOLANG_API int/*fd|err*/ Open(const char *path, int flags, mode_t mode);
LIBGOLANG_API __Errno Pipe(int vfd[2], int flags); LIBGOLANG_API __Errno Pipe(int vfd[2], int flags);
#ifndef _WIN32 #ifndef LIBGOLANG_OS_windows
LIBGOLANG_API __Errno Sigaction(int signo, const struct ::sigaction *act, struct ::sigaction *oldact); LIBGOLANG_API __Errno Sigaction(int signo, const struct ::sigaction *act, struct ::sigaction *oldact);
#endif #endif
typedef void (*sighandler_t)(int); typedef void (*sighandler_t)(int);
......
...@@ -52,7 +52,7 @@ ...@@ -52,7 +52,7 @@
#include <linux/list.h> #include <linux/list.h>
// MSVC does not support statement expressions and typeof // MSVC does not support statement expressions and typeof
// -> redo list_entry via C++ lambda. // -> redo list_entry via C++ lambda.
#ifdef _MSC_VER #ifdef LIBGOLANG_CC_msc
# undef list_entry # undef list_entry
# define list_entry(ptr, type, member) [&]() { \ # define list_entry(ptr, type, member) [&]() { \
const decltype( ((type *)0)->member ) *__mptr = (ptr); \ const decltype( ((type *)0)->member ) *__mptr = (ptr); \
......
#ifndef _NXD_LIBGOLANG_RUNTIME_PLATFORM_H
#define _NXD_LIBGOLANG_RUNTIME_PLATFORM_H
// Copyright (C) 2023-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Header platform.h provides preprocessor defines that describe target platform.
// LIBGOLANG_OS_<X> is defined on operating system X.
//
// List of supported operating systems: linux, darwin, windows.
#ifdef __linux__
# define LIBGOLANG_OS_linux 1
#elif defined(__APPLE__)
# define LIBGOLANG_OS_darwin 1
#elif defined(_WIN32) || defined(__CYGWIN__)
# define LIBGOLANG_OS_windows 1
#else
# error "unsupported operating system"
#endif
// LIBGOLANG_CC_<X> is defined on C/C++ compiler X.
//
// List of supported compilers: gcc, clang, msc.
#ifdef __clang__
# define LIBGOLANG_CC_clang 1
#elif defined(_MSC_VER)
# define LIBGOLANG_CC_msc 1
// NOTE gcc comes last because e.g. clang and icc define __GNUC__ as well
#elif __GNUC__
# define LIBGOLANG_CC_gcc 1
#else
# error "unsupported compiler"
#endif
#endif // _NXD_LIBGOLANG_RUNTIME_PLATFORM_H
# cython: language_level=2
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package strconv provides Go-compatible string conversions.
See _strconv.pxd for package documentation.
"""
# redirect cimport: golang.strconv -> golang._strconv (see __init__.pxd for rationale)
from golang._strconv cimport *
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2018-2022 Nexedi SA and Contributors. # Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -21,174 +21,7 @@ ...@@ -21,174 +21,7 @@
from __future__ import print_function, absolute_import from __future__ import print_function, absolute_import
import unicodedata, codecs from golang._strconv import \
from six import text_type as unicode # py2: unicode py3: str pyquote as quote, \
from six.moves import range as xrange pyunquote as unquote, \
pyunquote_next as unquote_next
from golang import b, u
from golang._golang import _py_utf8_decode_rune as _utf8_decode_rune, _py_rune_error as _rune_error, _xunichr
# _bstr is like b but also returns whether input was unicode.
def _bstr(s): # -> sbytes, wasunicode
return b(s), isinstance(s, unicode)
# _ustr is like u but also returns whether input was bytes.
def _ustr(s): # -> sunicode, wasbytes
return u(s), isinstance(s, bytes)
# quote quotes unicode|bytes string into valid "..." unicode|bytes string always quoted with ".
def quote(s):
s, wasunicode = _bstr(s)
qs = _quote(s)
if wasunicode:
qs, _ = _ustr(qs)
return qs
def _quote(s):
assert isinstance(s, bytes)
outv = []
emit = outv.append
i = 0
while i < len(s):
c = s[i:i+1]
# fast path - ASCII only
if ord(c) < 0x80:
if c in b'\\"':
emit(b'\\'+c)
# printable ASCII
elif b' ' <= c <= b'\x7e':
emit(c)
# non-printable ASCII
elif c == b'\t':
emit(br'\t')
elif c == b'\n':
emit(br'\n')
elif c == b'\r':
emit(br'\r')
# everything else is non-printable
else:
emit(br'\x%02x' % ord(c))
i += 1
# slow path - full UTF-8 decoding + unicodedata
else:
r, size = _utf8_decode_rune(s[i:])
isize = i + size
# decode error - just emit raw byte as escaped
if r == _rune_error and size == 1:
emit(br'\x%02x' % ord(c))
# printable utf-8 characters go as is
elif unicodedata.category(_xunichr(r))[0] in _printable_cat0:
emit(s[i:isize])
# everything else goes in numeric byte escapes
else:
for j in xrange(i, isize):
emit(br'\x%02x' % ord(s[j:j+1]))
i = isize
return b'"' + b''.join(outv) + b'"'
# unquote decodes "-quoted unicode|byte string.
#
# ValueError is raised if there are quoting syntax errors.
def unquote(s):
us, tail = unquote_next(s)
if len(tail) != 0:
raise ValueError('non-empty tail after closing "')
return us
# unquote_next decodes next "-quoted unicode|byte string.
#
# it returns -> (unquoted(s), tail-after-")
#
# ValueError is raised if there are quoting syntax errors.
def unquote_next(s):
s, wasunicode = _bstr(s)
us, tail = _unquote_next(s)
if wasunicode:
us, _ = _ustr(us)
tail, _ = _ustr(tail)
return us, tail
def _unquote_next(s):
assert isinstance(s, bytes)
if len(s) == 0 or s[0:0+1] != b'"':
raise ValueError('no starting "')
outv = []
emit= outv.append
s = s[1:]
while 1:
r, width = _utf8_decode_rune(s)
if width == 0:
raise ValueError('no closing "')
if r == ord('"'):
s = s[1:]
break
# regular UTF-8 character
if r != ord('\\'):
emit(s[:width])
s = s[width:]
continue
if len(s) < 2:
raise ValueError('unexpected EOL after \\')
c = s[1:1+1]
# \<c> -> <c> ; c = \ "
if c in b'\\"':
emit(c)
s = s[2:]
continue
# \t \n \r
uc = None
if c == b't': uc = b'\t'
elif c == b'n': uc = b'\n'
elif c == b'r': uc = b'\r'
# accept also \a \b \v \f that Go might produce
# Python also decodes those escapes even though it does not produce them:
# https://github.com/python/cpython/blob/2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L677-L688
elif c == b'a': uc = b'\x07'
elif c == b'b': uc = b'\x08'
elif c == b'v': uc = b'\x0b'
elif c == b'f': uc = b'\x0c'
if uc is not None:
emit(uc)
s = s[2:]
continue
# \x?? hex
if c == b'x': # XXX also handle octals?
if len(s) < 2+2:
raise ValueError('unexpected EOL after \\x')
b = codecs.decode(s[2:2+2], 'hex')
emit(b)
s = s[2+2:]
continue
raise ValueError('invalid escape \\%s' % chr(ord(c[0:0+1])))
return b''.join(outv), s
_printable_cat0 = frozenset(['L', 'N', 'P', 'S']) # letters, numbers, punctuation, symbols
# -*- coding: utf-8 -*- # -*- coding: utf-8 -*-
# Copyright (C) 2018-2022 Nexedi SA and Contributors. # Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com> # Kirill Smelkov <kirr@nexedi.com>
# #
# This program is free software: you can Use, Study, Modify and Redistribute # This program is free software: you can Use, Study, Modify and Redistribute
...@@ -20,12 +20,16 @@ ...@@ -20,12 +20,16 @@
from __future__ import print_function, absolute_import from __future__ import print_function, absolute_import
from golang import bstr
from golang.strconv import quote, unquote, unquote_next from golang.strconv import quote, unquote, unquote_next
from golang.gcompat import qq from golang.gcompat import qq
from six import int2byte as bchr, PY3 from six import int2byte as bchr
from six.moves import range as xrange from six.moves import range as xrange
from pytest import raises from pytest import raises, mark
import codecs
def byterange(start, stop): def byterange(start, stop):
b = b"" b = b""
...@@ -34,16 +38,9 @@ def byterange(start, stop): ...@@ -34,16 +38,9 @@ def byterange(start, stop):
return b return b
# asstr converts unicode|bytes to str type of current python. def assert_bstreq(x, y):
def asstr(s): assert type(x) is bstr
if PY3: assert x == y
if isinstance(s, bytes):
s = s.decode('utf-8')
# PY2
else:
if isinstance(s, unicode):
s = s.encode('utf-8')
return s
def test_quote(): def test_quote():
testv = ( testv = (
...@@ -72,6 +69,9 @@ def test_quote(): ...@@ -72,6 +69,9 @@ def test_quote():
(u'\ufffd', u'�'), (u'\ufffd', u'�'),
) )
# quote/unquote* always give bstr
BEQ = assert_bstreq
for tin, tquoted in testv: for tin, tquoted in testv:
# quote(in) == quoted # quote(in) == quoted
# in = unquote(quoted) # in = unquote(quoted)
...@@ -79,14 +79,13 @@ def test_quote(): ...@@ -79,14 +79,13 @@ def test_quote():
tail = b'123' if isinstance(tquoted, bytes) else '123' tail = b'123' if isinstance(tquoted, bytes) else '123'
tquoted = q + tquoted + q # add lead/trail " tquoted = q + tquoted + q # add lead/trail "
assert quote(tin) == tquoted BEQ(quote(tin), tquoted)
assert unquote(tquoted) == tin BEQ(unquote(tquoted), tin)
assert unquote_next(tquoted) == (tin, type(tin)()) _, __ = unquote_next(tquoted); BEQ(_, tin); BEQ(__, "")
assert unquote_next(tquoted + tail) == (tin, tail) _, __ = unquote_next(tquoted + tail); BEQ(_, tin); BEQ(__, tail)
with raises(ValueError): unquote(tquoted + tail) with raises(ValueError): unquote(tquoted + tail)
# qq always gives str BEQ(qq(tin), tquoted)
assert qq(tin) == asstr(tquoted)
# also check how it works on complementary unicode/bytes input type # also check how it works on complementary unicode/bytes input type
if isinstance(tin, bytes): if isinstance(tin, bytes):
...@@ -103,14 +102,13 @@ def test_quote(): ...@@ -103,14 +102,13 @@ def test_quote():
tquoted = tquoted.encode('utf-8') tquoted = tquoted.encode('utf-8')
tail = tail.encode('utf-8') tail = tail.encode('utf-8')
assert quote(tin) == tquoted BEQ(quote(tin), tquoted)
assert unquote(tquoted) == tin BEQ(unquote(tquoted), tin)
assert unquote_next(tquoted) == (tin, type(tin)()) _, __ = unquote_next(tquoted); BEQ(_, tin); BEQ(__, "")
assert unquote_next(tquoted + tail) == (tin, tail) _, __ = unquote_next(tquoted + tail); BEQ(_, tin); BEQ(__, tail)
with raises(ValueError): unquote(tquoted + tail) with raises(ValueError): unquote(tquoted + tail)
# qq always gives str BEQ(qq(tin), tquoted)
assert qq(tin) == asstr(tquoted)
# verify that non-canonical quotation can be unquoted too. # verify that non-canonical quotation can be unquoted too.
...@@ -143,3 +141,52 @@ def test_unquote_bad(): ...@@ -143,3 +141,52 @@ def test_unquote_bad():
with raises(ValueError) as exc: with raises(ValueError) as exc:
unquote(tin) unquote(tin)
assert exc.value.args == (err,) assert exc.value.args == (err,)
# ---- benchmarks ----
# quoting + unquoting
uchar_testv = ['a', # ascii
u'α', # 2-bytes utf8
u'\u65e5', # 3-bytes utf8
u'\U0001f64f'] # 4-bytes utf8
@mark.parametrize('ch', uchar_testv)
def bench_quote(b, ch):
s = bstr_ch1000(ch)
q = quote
for i in xrange(b.N):
q(s)
def bench_stdquote(b):
s = b'a'*1000
q = repr
for i in xrange(b.N):
q(s)
@mark.parametrize('ch', uchar_testv)
def bench_unquote(b, ch):
s = bstr_ch1000(ch)
s = quote(s)
unq = unquote
for i in xrange(b.N):
unq(s)
def bench_stdunquote(b):
s = b'"' + b'a'*1000 + b'"'
escape_decode = codecs.escape_decode
def unq(s): return escape_decode(s[1:-1])[0]
for i in xrange(b.N):
unq(s)
# bstr_ch1000 returns bstr with many repetitions of character ch occupying ~ 1000 bytes.
def bstr_ch1000(ch): # -> bstr
assert len(ch) == 1
s = bstr(ch)
s = s * (1000 // len(s))
if len(s) % 3 == 0:
s += 'x'
assert len(s) == 1000
return s
...@@ -18,7 +18,7 @@ ...@@ -18,7 +18,7 @@
# #
# See COPYING file for full licensing terms. # See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options. # See https://www.nexedi.com/licensing for rationale and options.
"""This program helps to verify _pystr and _pyunicode. """This program helps to verify b, u and underlying bstr and ustr.
It complements golang_str_test.test_strings_print. It complements golang_str_test.test_strings_print.
""" """
...@@ -31,8 +31,17 @@ from golang.gcompat import qq ...@@ -31,8 +31,17 @@ from golang.gcompat import qq
def main(): def main():
sb = b("привет αβγ b") sb = b("привет αβγ b")
su = u("привет αβγ u") su = u("привет αβγ u")
print("print(b):", sb)
print("print(u):", su)
print("print(qq(b)):", qq(sb)) print("print(qq(b)):", qq(sb))
print("print(qq(u)):", qq(su)) print("print(qq(u)):", qq(su))
print("print(repr(b)):", repr(sb))
print("print(repr(u)):", repr(su))
# py2: print(dict) calls PyObject_Print(flags=0) for both keys and values,
# not with flags=Py_PRINT_RAW used by default almost everywhere else.
# this way we can verify whether bstr.tp_print handles flags correctly.
print("print({b: u}):", {sb: su})
if __name__ == '__main__': if __name__ == '__main__':
......
print(b): привет αβγ b
print(u): привет αβγ u
print(qq(b)): "привет αβγ b" print(qq(b)): "привет αβγ b"
print(qq(u)): "привет αβγ u" print(qq(u)): "привет αβγ u"
print(repr(b)): b('привет αβγ b')
print(repr(u)): u('привет αβγ u')
print({b: u}): {b('привет αβγ b'): u('привет αβγ u')}
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2022-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""This program helps to verify [:] handling for bstr and ustr.
It complements golang_str_test.test_strings_index2.
It needs to verify [:] only lightly because thorough verification is done in
test_string_index, and here we need to verify only that __getslice__, inherited
from builtin str/unicode, does not get into our way.
"""
from __future__ import print_function, absolute_import
from golang import b, u, bstr, ustr
from golang.gcompat import qq
def main():
us = u("миру мир")
bs = b("миру мир")
def emit(what, uobj, bobj):
assert type(uobj) is ustr
assert type(bobj) is bstr
print("u"+what, qq(uobj))
print("b"+what, qq(bobj))
emit("s", us, bs)
emit("s[:]", us[:], bs[:])
emit("s[0:1]", us[0:1], bs[0:1])
emit("s[0:2]", us[0:2], bs[0:2])
emit("s[1:2]", us[1:2], bs[1:2])
emit("s[0:-1]", us[0:-1], bs[0:-1])
if __name__ == '__main__':
main()
us "миру мир"
bs "миру мир"
us[:] "миру мир"
bs[:] "миру мир"
us[0:1] "м"
bs[0:1] "\xd0"
us[0:2] "ми"
bs[0:2] "м"
us[1:2] "и"
bs[1:2] "\xbc"
us[0:-1] "миру ми"
bs[0:-1] "миру ми\xd1"
# cython: language_level=2
# Copyright (C) 2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package utf8 mirrors Go package utf8.
See https://golang.org/pkg/unicode/utf8 for Go utf8 package documentation.
"""
from golang cimport rune
cdef extern from "golang/unicode/utf8.h" namespace "golang::unicode::utf8" nogil:
rune RuneError
#ifndef _NXD_LIBGOLANG_UNICODE_UTF8_H
#define _NXD_LIBGOLANG_UNICODE_UTF8_H
// Copyright (C) 2023 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package utf8 mirrors Go package utf8.
#include <golang/libgolang.h>
// golang::unicode::utf8::
namespace golang {
namespace unicode {
namespace utf8 {
constexpr rune RuneError = 0xFFFD; // unicode replacement character
}}} // golang::os::utf8::
#endif // _NXD_LIBGOLANG_UNICODE_UTF8_H
# cython: language_level=2
# Copyright (C) 2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package utf8 mirrors Go package utf8.
See _utf8.pxd for package documentation.
"""
# redirect cimport: golang.unicode.utf8 -> golang.unicode._utf8 (see __init__.pxd for rationale)
from golang.unicode._utf8 cimport *
...@@ -71,6 +71,12 @@ def test_golang_builtins(): ...@@ -71,6 +71,12 @@ def test_golang_builtins():
assert error is golang.error assert error is golang.error
assert b is golang.b assert b is golang.b
assert u is golang.u assert u is golang.u
assert bstr is golang.bstr
assert ustr is golang.ustr
assert biter is golang.biter
assert uiter is golang.uiter
assert bbyte is golang.bbyte
assert uchr is golang.uchr
# indirectly verify golang.__all__ # indirectly verify golang.__all__
for k in golang.__all__: for k in golang.__all__:
......
...@@ -19,6 +19,25 @@ ...@@ -19,6 +19,25 @@
# See COPYING file for full licensing terms. # See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options. # See https://www.nexedi.com/licensing for rationale and options.
# patch cython to allow `cdef class X(bytes)` while building pygolang to
# workaround https://github.com/cython/cython/issues/711
# see `cdef class pybstr` in golang/_golang_str.pyx for details.
# (should become unneeded with cython 3 once https://github.com/cython/cython/pull/5212 is finished)
import inspect
from Cython.Compiler.PyrexTypes import BuiltinObjectType
def pygo_cy_builtin_type_name_set(self, v):
self._pygo_name = v
def pygo_cy_builtin_type_name_get(self):
name = self._pygo_name
if name == 'bytes':
caller = inspect.currentframe().f_back.f_code.co_name
if caller == 'analyse_declarations':
# need anything different from 'bytes' to deactivate check in
# https://github.com/cython/cython/blob/c21b39d4/Cython/Compiler/Nodes.py#L4759-L4762
name = 'xxx'
return name
BuiltinObjectType.name = property(pygo_cy_builtin_type_name_get, pygo_cy_builtin_type_name_set)
from setuptools import find_packages from setuptools import find_packages
from setuptools.command.install_scripts import install_scripts as _install_scripts from setuptools.command.install_scripts import install_scripts as _install_scripts
from setuptools.command.develop import develop as _develop from setuptools.command.develop import develop as _develop
...@@ -166,7 +185,8 @@ for pkg in R: ...@@ -166,7 +185,8 @@ for pkg in R:
R['all'] = Rall R['all'] = Rall
# ipython/pytest are required to test py2 integration patches # ipython/pytest are required to test py2 integration patches
R['all_test'] = Rall.union(['ipython', 'pytest']) # pip does not like "+" in all+test # zodbpickle is used to test pickle support for bstr/ustr
R['all_test'] = Rall.union(['ipython', 'pytest', 'zodbpickle']) # pip does not like "+" in all+test
# extras_require <- R # extras_require <- R
extras_require = {} extras_require = {}
...@@ -207,6 +227,7 @@ setup( ...@@ -207,6 +227,7 @@ setup(
['golang/runtime/libgolang.cpp', ['golang/runtime/libgolang.cpp',
'golang/runtime/internal/atomic.cpp', 'golang/runtime/internal/atomic.cpp',
'golang/runtime/internal/syscall.cpp', 'golang/runtime/internal/syscall.cpp',
'golang/runtime.cpp',
'golang/context.cpp', 'golang/context.cpp',
'golang/errors.cpp', 'golang/errors.cpp',
'golang/fmt.cpp', 'golang/fmt.cpp',
...@@ -218,9 +239,11 @@ setup( ...@@ -218,9 +239,11 @@ setup(
'golang/time.cpp'], 'golang/time.cpp'],
depends = [ depends = [
'golang/libgolang.h', 'golang/libgolang.h',
'golang/runtime.h',
'golang/runtime/internal.h', 'golang/runtime/internal.h',
'golang/runtime/internal/atomic.h', 'golang/runtime/internal/atomic.h',
'golang/runtime/internal/syscall.h', 'golang/runtime/internal/syscall.h',
'golang/runtime/platform.h',
'golang/context.h', 'golang/context.h',
'golang/cxx.h', 'golang/cxx.h',
'golang/errors.h', 'golang/errors.h',
...@@ -249,7 +272,9 @@ setup( ...@@ -249,7 +272,9 @@ setup(
ext_modules = [ ext_modules = [
Ext('golang._golang', Ext('golang._golang',
['golang/_golang.pyx'], ['golang/_golang.pyx'],
depends = ['golang/_golang_str.pyx']), depends = [
'golang/_golang_str.pyx',
'golang/_golang_str_pickle.pyx']),
Ext('golang.runtime._runtime_thread', Ext('golang.runtime._runtime_thread',
['golang/runtime/_runtime_thread.pyx']), ['golang/runtime/_runtime_thread.pyx']),
...@@ -301,6 +326,9 @@ setup( ...@@ -301,6 +326,9 @@ setup(
Ext('golang.os._signal', Ext('golang.os._signal',
['golang/os/_signal.pyx']), ['golang/os/_signal.pyx']),
Ext('golang._strconv',
['golang/_strconv.pyx']),
Ext('golang._strings_test', Ext('golang._strings_test',
['golang/_strings_test.pyx', ['golang/_strings_test.pyx',
'golang/strings_test.cpp']), 'golang/strings_test.cpp']),
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment