Commit 50b3808c authored by Kirill Smelkov's avatar Kirill Smelkov

Uniform UTF8-based approach to strings

Context: together with Jérome we've been struggling with porting Zodbtools to
Python3 for several years. Despite several incremental attempts[1,2,3]
we are not there yet with the main difficulty being backward compatibility breakage
that Python3 did for bytes and unicode. During my last trial this spring, after
I've tried once again to finish this porting and could not reach satisfactory
result, I've finally decided to do something about this at the root of the
cause: at the level of strings - where backward compatibility was broken - with
the idea to fix everything once and for all.

In 2018 in "Python 3 Losses: Nexedi Perspective"[4] and associated "cost
overview"[5] Jean-Paul highlighted the problem of strings backward
compatibility breakage, that Python 3 did, as the major one.

In 2019 we had some conversations with Jérome about this topic as well[6,7].

In 2020 I've started to approach it with `b` and `u` that provide
always-working conversion in between bytes and unicode[8], and via limited
usage of custom bytes- and unicode- like types that are interoperable with both
bytes and unicode simultaneously[9].

Today, with this work, I'm finally exposing those types for general usage, so
that bytes/unicode problem could be handled automatically. The overview of the
functionality is provided below:

---- 8< ----

Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
the idea to make working with byte- and unicode- strings easy and transparently
interoperable:

- `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` (*).
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`.

The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.

Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.

Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly (+). However it is possible to yield unicode
character when iterating `bstr` via `uiter`, and to yield byte character when
iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of the
time, and `ustr` only needs to be used for random access to string characters.
See [Strings, bytes, runes and characters in Go](https://blog.golang.org/strings) for overview of this approach.

Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
to `ustr`.  When the coercion happens, `bytes` and `bytearray`, similarly to
`bstr`, are also treated as UTF8-encoded strings.

`bstr` and `ustr` are meant to be drop-in replacements for standard
`str`/`unicode` classes. They support all methods of `str`/`unicode` and in
particular their constructors accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer`
interface (%), to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.

Usage example:

```py
   s  = b('привет')     # s is bstr corresponding to UTF-8 encoding of 'привет'.
   s += ' мир'          # s is b('привет мир')
   for c in uiter(s):   # c will iterate through
        ...             #     [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]

   # the following gives b('привет мир труд май')
   b('привет %s %s %s') % (u'мир',                  # raw unicode
                           u'труд'.encode('utf-8'), # raw bytes
                           u('май'))                # ustr

   def f(s):
      s = u(s)          # make sure s is ustr, decoding as UTF-8(^) if it was bstr, bytes, bytearray or buffer.
      ...               # (^) the decoding never fails nor looses information.
```

(*) `unicode` on Python2, `str` on Python3.
(+) ordinal of such byte and unicode character can be obtained via regular `ord`.
    For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal.
(%) data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string.
    Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.

---- 8< ----

With this e.g. zodbtools is finally ported to Python3 easily[10].

One note is that we change `b` and `u` to return `bstr`/`ustr` instead of
`bytes`/`unicode`. This is change in behaviour, but I hope it won't break
anything. The reason for this is that now-returned `bstr` and `ustr` are meant
to be drop-in replacements for standard string types, and that there are not
many existing `b` and `u` users. We just need to make sure that the places,
that already use `b` and `u` continue to work. Those include Zodbtools,
Nxdtest[11], and lonet[12], which should continue to work ok.

@klaus, you once said that you use `b` and `u` somewhere as well. Please do not
hesitate to let me know if this change causes any issues for you, and we will,
hopefully, try to find a solution.

Kirill

/cc @jerome, @klaus, @kazuhiko, @vpelletier, @yusei, @tatuya
/reviewed-and-discussed-on !21

[1] zodbtools!12
[2] zodbtools!13
[3] zodbtools!16
[4] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20/1
[5] https://www.nexedi.com/NXD-Presentation.Multicore.PyconFR.2018?portal_skin=CI_slideshow#/20
[6] zodbtools!8 (comment 73726)
[7] zodbtools!13 (comment 81646)
[8] bcb95cd5
[9] edc7aaab
[10] zodbtools@9861c136
[11] https://lab.nexedi.com/nexedi/nxdtest
[12] https://lab.nexedi.com/kirr/go123/blob/master/xnet/lonet/__init__.py
parents f59a785d 5bf08f8b
......@@ -2,6 +2,9 @@ include COPYING README.rst CHANGELOG.rst tox.ini pyproject.toml trun .lsan-ignor
include golang/libgolang.h
include golang/runtime/libgolang.cpp
include golang/runtime/libpyxruntime.cpp
include golang/runtime/platform.h
include golang/runtime.h
include golang/runtime.cpp
include golang/pyx/runtime.h
include golang/pyx/testprog/golang_dso_user/dsouser/dso.h
include golang/pyx/testprog/golang_dso_user/dsouser/dso.cpp
......
......@@ -10,7 +10,7 @@ Package `golang` provides Go-like features for Python:
- `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode.
- `b`, `u` and `bstr`/`ustr` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace.
Package `golang.pyx` provides__ similar features for Cython/nogil.
......@@ -229,19 +229,64 @@ __ https://www.python.org/dev/peps/pep-3134/
Strings
-------
`b` and `u` provide way to make sure an object is either bytes or unicode.
`b(obj)` converts str/unicode/bytes obj to UTF-8 encoded bytestring, while
`u(obj)` converts str/unicode/bytes obj to unicode string. For example::
Pygolang, similarly to Go, provides uniform UTF8-based approach to strings with
the idea to make working with byte- and unicode- strings easy and transparently
interoperable:
b("привет мир") # -> gives bytes corresponding to UTF-8 encoding of "привет мир".
- `bstr` is byte-string: it is based on `bytes` and can automatically convert to/from `unicode` [*]_.
- `ustr` is unicode-string: it is based on `unicode` and can automatically convert to/from `bytes`.
def f(s):
s = u(s) # make sure s is unicode, decoding as UTF-8(*) if it was bytes.
... # (*) but see below about lack of decode errors.
The conversion, in both encoding and decoding, never fails and never looses
information: `bstr→ustr→bstr` and `ustr→bstr→ustr` are always identity
even if bytes data is not valid UTF-8.
Both `bstr` and `ustr` represent stings. They are two different *representations* of the same entity.
Semantically `bstr` is array of bytes, while `ustr` is array of
unicode-characters. Accessing their elements by `[index]` and iterating them yield byte and
unicode character correspondingly [*]_. However it is possible to yield unicode
character when iterating `bstr` via `uiter`, and to yield byte character when
iterating `ustr` via `biter`. In practice `bstr` + `uiter` is enough 99% of
the time, and `ustr` only needs to be used for random access to string
characters. See `Strings, bytes, runes and characters in Go`__ for overview of
this approach.
__ https://blog.golang.org/strings
Operations in between `bstr` and `ustr`/`unicode` / `bytes`/`bytearray` coerce to `bstr`, while
operations in between `ustr` and `bstr`/`bytes`/`bytearray` / `unicode` coerce
to `ustr`. When the coercion happens, `bytes` and `bytearray`, similarly to
`bstr`, are also treated as UTF8-encoded strings.
The conversion in both encoding and decoding never fails and never looses
information: `b(u(·))` and `u(b(·))` are always identity for bytes and unicode
correspondingly, even if bytes input is not valid UTF-8.
`bstr` and `ustr` are meant to be drop-in replacements for standard
`str`/`unicode` classes. They support all methods of `str`/`unicode` and in
particular their constructors accept arbitrary objects and either convert or stringify them. For
cases when no stringification is desired, and one only wants to convert
`bstr`/`ustr` / `unicode`/`bytes`/`bytearray`, or an object with `buffer`
interface [*]_, to Pygolang string, `b` and `u` provide way to make sure an
object is either `bstr` or `ustr` correspondingly.
Usage example::
s = b('привет') # s is bstr corresponding to UTF-8 encoding of 'привет'.
s += ' мир' # s is b('привет мир')
for c in uiter(s): # c will iterate through
... # [u(_) for _ in ('п','р','и','в','е','т',' ','м','и','р')]
# the following gives b('привет мир труд май')
b('привет %s %s %s') % (u'мир', # raw unicode
u'труд'.encode('utf-8'), # raw bytes
u('май')) # ustr
def f(s):
s = u(s) # make sure s is ustr, decoding as UTF-8(*) if it was bstr, bytes, bytearray or buffer.
... # (*) the decoding never fails nor looses information.
.. [*] `unicode` on Python2, `str` on Python3.
.. [*] | ordinal of such byte and unicode character can be obtained via regular `ord`.
| For completeness `bbyte` and `uchr` are also provided for constructing 1-byte `bstr` and 1-character `ustr` from ordinal.
.. [*] | data in buffer, similarly to `bytes` and `bytearray`, is treated as UTF8-encoded string.
| Notice that only explicit conversion through `b` and `u` accept objects with buffer interface. Automatic coercion does not.
Import
......
......@@ -9,6 +9,7 @@
/_io.cpp
/_os.cpp
/_os_test.cpp
/_strconv.cpp
/_strings_test.cpp
/_sync.cpp
/_sync_test.cpp
......
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Copyright (C) 2018-2025 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -24,7 +24,7 @@
- `func` allows to define methods separate from class.
- `defer` allows to schedule a cleanup from the main control flow.
- `error` and package `errors` provide error chaining.
- `b` and `u` provide way to make sure an object is either bytes or unicode.
- `b`, `u`, `bstr`/`ustr` and `biter`/`uiter` provide uniform UTF8-based approach to strings.
- `gimport` allows to import python modules by full path in a Go workspace.
See README for thorough overview.
......@@ -36,7 +36,8 @@ from __future__ import print_function, absolute_import
__version__ = "0.1"
__all__ = ['go', 'chan', 'select', 'default', 'nilchan', 'defer', 'panic',
'recover', 'func', 'error', 'b', 'u', 'gimport']
'recover', 'func', 'error', 'b', 'u', 'bstr', 'ustr', 'biter', 'uiter', 'bbyte', 'uchr',
'gimport']
import setuptools_dso
setuptools_dso.dylink_prepare_dso('golang.runtime.libgolang')
......@@ -369,12 +370,11 @@ from ._golang import \
pypanic as panic, \
pyerror as error, \
pyb as b, \
pyu as u
# import golang.strconv into _golang from here to workaround cyclic golang ↔ strconv dependency
def _():
from . import _golang
from . import strconv
_golang.pystrconv = strconv
_()
del _
pybstr as bstr, \
pybbyte as bbyte, \
pyu as u, \
pyustr as ustr, \
pyuchr as uchr, \
pybiter as biter, \
pyuiter as uiter, \
_butf8b
# cython: language_level=2
# Copyright (C) 2019-2022 Nexedi SA and Contributors.
# Copyright (C) 2019-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -43,6 +43,7 @@ In addition to Cython/nogil API, golang.pyx provides runtime for golang.py:
- Python-level channels are represented by pychan + pyselect.
- Python-level error is represented by pyerror.
- Python-level panic is represented by pypanic.
- Python-level strings are represented by pybstr/pyustr and pyb/pyu.
"""
......@@ -64,6 +65,9 @@ cdef extern from *:
# on the edge of Python/nogil world.
from libcpp.string cimport string # golang::string = std::string
cdef extern from "golang/libgolang.h" namespace "golang" nogil:
ctypedef unsigned char byte
ctypedef signed int rune # = int32
void panic(const char *)
const char *recover()
......@@ -265,4 +269,11 @@ cdef class pyerror(Exception):
cdef object from_error (error err) # -> pyerror | None
# strings
cpdef pyb(s) # -> bstr
cpdef pyu(s) # -> ustr
cdef __pystr(object obj)
cdef (rune, int) _utf8_decode_rune(const byte[::1] s)
cdef unicode _xunichr(rune i)
......@@ -3,7 +3,7 @@
# cython: binding=False
# cython: c_string_type=str, c_string_encoding=utf8
# distutils: language = c++
# distutils: depends = libgolang.h os/signal.h _golang_str.pyx
# distutils: depends = libgolang.h os/signal.h unicode/utf8.h _golang_str.pyx _golang_str_pickle.pyx
#
# Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
......
This diff is collapsed.
# -*- coding: utf-8 -*-
# Copyright (C) 2023-2025 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""_golang_str_pickle.pyx complements _golang_str.pyx and keeps everything
related to pickling strings.
It is included from _golang_str.pyx .
"""
if PY_MAJOR_VERSION >= 3:
import copyreg as pycopyreg
else:
import copy_reg as pycopyreg
cdef object zbinary # = zodbpickle.binary | None
try:
import zodbpickle
except ImportError:
zbinary = None
else:
zbinary = zodbpickle.binary
# support for pickling bstr/ustr as standalone types.
#
# pickling is organized in such a way that
# - what is saved by py2 can be loaded correctly on both py2/py3, and similarly
# - what is saved by py3 can be loaded correctly on both py2/py3 as well.
cdef _bstr__reduce_ex__(self, protocol):
# Ideally we want to emit bstr(BYTES), but BYTES is not available for
# protocol < 3. And for protocol < 3 emitting bstr(STRING) is not an
# option because plain py3 raises UnicodeDecodeError on loading arbitrary
# STRING data. However emitting bstr(UNICODE) works universally because
# pickle supports arbitrary unicode - including invalid unicode - out of
# the box and in exactly the same way on both py2 and py3. For the
# reference upstream py3 uses surrogatepass on encode/decode UNICODE data
# to achieve that.
if protocol < 3:
# use UNICODE for data
#
# explicitly mark to unpickle via _butf8b because with the introduction
# of UTF-8bk the way bstr decodes unicode will change, and so if we
# would use `bstr UNICODE` for pickling it will result in corrupt data
# to be loaded after the switch to UTF-8bk.
#
# TODO pickle via bstr UNICODE REDUCE/NEWOBJ after switch from UTF-8b to UTF-8bk.
udata = _utf8_decode_surrogateescape(self)
if self.__class__ is pybstr:
return (_butf8b, # _butf8b UNICODE REDUCE
(udata,))
else:
return (_butf8b, # _butf8b bstr UNICODE REDUCE
(self.__class__, udata))
else:
# use BYTES for data
bdata = _bdata(self)
if PY_MAJOR_VERSION < 3:
# the only way we can get here on py2 and protocol >= 3 is zodbpickle
# -> similarly to py3 save bdata as BYTES
assert zbinary is not None
bdata = zbinary(bdata)
return (
pycopyreg.__newobj__, # bstr BYTES NEWOBJ
(self.__class__, bdata))
cdef _ustr__reduce_ex__(self, protocol):
# emit ustr(UNICODE).
# TODO after UTF-8bk we might want to switch to emitting ustr(BYTES)
# even if we do this, it should be backward compatible
if protocol < 2:
return (self.__class__, (_udata(self),))# ustr UNICODE REDUCE
else:
return (pycopyreg.__newobj__, # ustr UNICODE NEWOBJ
(self.__class__, _udata(self)))
# `_butf8b [bcls] udata` serves unpickling of bstr pickled with data
# represented via UTF-8b decoded unicode.
def _butf8b(*argv):
cdef object bcls = pybstr
cdef object udata
cdef int l = len(argv)
if l == 1:
udata = argv[0]
elif l == 2:
bcls, udata = argv
else:
raise TypeError("_butf8b() takes 1 or 2 arguments; %d given" % l)
return _pyb(bcls, _utf8_encode_surrogateescape(udata))
_butf8b.__module__ = "golang"
# -*- coding: utf-8 -*-
# cython: language_level=2
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package strconv provides Go-compatible string conversions."""
from golang cimport byte
cpdef pyquote(s)
cdef bytes _quote(const byte[::1] s, char quote, bint* out_nonascii_escape) # -> (quoted, nonascii_escape)
# -*- coding: utf-8 -*-
# cython: language_level=2
# Copyright (C) 2018-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""_strconv.pyx implements strconv.pyx - see _strconv.pxd for package overview."""
from __future__ import print_function, absolute_import
import unicodedata, codecs
from golang cimport pyb, byte, rune
from golang cimport _utf8_decode_rune, _xunichr
from golang.unicode cimport utf8
from cpython cimport PyObject, _PyBytes_Resize
cdef extern from "Python.h":
PyObject* PyBytes_FromStringAndSize(char*, Py_ssize_t) except NULL
char* PyBytes_AS_STRING(PyObject*)
void Py_DECREF(PyObject*)
# quote quotes unicode|bytes string into valid "..." bytestring always quoted with ".
cpdef pyquote(s): # -> bstr
cdef bint _
q = _quote(pyb(s), '"', &_)
return pyb(q)
cdef char[16] hexdigit # = '0123456789abcdef'
for i, c in enumerate('0123456789abcdef'):
hexdigit[i] = ord(c)
# XXX not possible to use `except (NULL, False)`
# (https://stackoverflow.com/a/66335433/9456786)
cdef bytes _quote(const byte[::1] s, char quote, bint* out_nonascii_escape): # -> (quoted, nonascii_escape)
# 2*" + max(4)*each byte (+ 1 for tail \0 implicitly by PyBytesObject)
cdef Py_ssize_t qmaxsize = 1 + 4*len(s) + 1
cdef PyObject* qout = PyBytes_FromStringAndSize(NULL, qmaxsize)
cdef byte* q = <byte*>PyBytes_AS_STRING(qout)
cdef bint nonascii_escape = False
cdef Py_ssize_t i = 0, j
cdef Py_ssize_t isize
cdef int size
cdef rune r
cdef byte c
q[0] = quote; q += 1
while i < len(s):
c = s[i]
# fast path - ASCII only
if c < 0x80:
if c in (ord('\\'), quote):
q[0] = ord('\\')
q[1] = c
q += 2
# printable ASCII
elif 0x20 <= c <= 0x7e:
q[0] = c
q += 1
# non-printable ASCII
elif c == ord('\t'):
q[0] = ord('\\')
q[1] = ord('t')
q += 2
elif c == ord('\n'):
q[0] = ord('\\')
q[1] = ord('n')
q += 2
elif c == ord('\r'):
q[0] = ord('\\')
q[1] = ord('r')
q += 2
# everything else is non-printable
else:
q[0] = ord('\\')
q[1] = ord('x')
q[2] = hexdigit[c >> 4]
q[3] = hexdigit[c & 0xf]
q += 4
i += 1
# slow path - full UTF-8 decoding + unicodedata
else:
r, size = _utf8_decode_rune(s[i:])
isize = i + size
# decode error - just emit raw byte as escaped
if r == utf8.RuneError and size == 1:
nonascii_escape = True
q[0] = ord('\\')
q[1] = ord('x')
q[2] = hexdigit[c >> 4]
q[3] = hexdigit[c & 0xf]
q += 4
# printable utf-8 characters go as is
elif _unicodedata_category(_xunichr(r))[0] in 'LNPS': # letters, numbers, punctuation, symbols
for j in range(i, isize):
q[0] = s[j]
q += 1
# everything else goes in numeric byte escapes
else:
nonascii_escape = True
for j in range(i, isize):
c = s[j]
q[0] = ord('\\')
q[1] = ord('x')
q[2] = hexdigit[c >> 4]
q[3] = hexdigit[c & 0xf]
q += 4
i = isize
q[0] = quote; q += 1
q[0] = 0; # don't q++ at last because size does not include tail \0
cdef Py_ssize_t qsize = (q - <byte*>PyBytes_AS_STRING(qout))
assert qsize <= qmaxsize
_PyBytes_Resize(&qout, qsize)
bqout = <bytes>qout
Py_DECREF(qout)
out_nonascii_escape[0] = nonascii_escape
return bqout
# unquote decodes "-quoted unicode|byte string.
#
# ValueError is raised if there are quoting syntax errors.
def pyunquote(s): # -> bstr
us, tail = pyunquote_next(s)
if len(tail) != 0:
raise ValueError('non-empty tail after closing "')
return us
# unquote_next decodes next "-quoted unicode|byte string.
#
# it returns -> (unquoted(s), tail-after-")
#
# ValueError is raised if there are quoting syntax errors.
def pyunquote_next(s): # -> (bstr, bstr)
us, tail = _unquote_next(pyb(s))
return pyb(us), pyb(tail)
cdef _unquote_next(s):
assert isinstance(s, bytes)
if len(s) == 0 or s[0:0+1] != b'"':
raise ValueError('no starting "')
outv = []
emit= outv.append
s = s[1:]
while 1:
r, width = _utf8_decode_rune(s)
if width == 0:
raise ValueError('no closing "')
if r == ord('"'):
s = s[1:]
break
# regular UTF-8 character
if r != ord('\\'):
emit(s[:width])
s = s[width:]
continue
if len(s) < 2:
raise ValueError('unexpected EOL after \\')
c = s[1:1+1]
# \<c> -> <c> ; c = \ "
if c in b'\\"':
emit(c)
s = s[2:]
continue
# \t \n \r
uc = None
if c == b't': uc = b'\t'
elif c == b'n': uc = b'\n'
elif c == b'r': uc = b'\r'
# accept also \a \b \v \f that Go might produce
# Python also decodes those escapes even though it does not produce them:
# https://github.com/python/cpython/blob/2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L677-L688
elif c == b'a': uc = b'\x07'
elif c == b'b': uc = b'\x08'
elif c == b'v': uc = b'\x0b'
elif c == b'f': uc = b'\x0c'
if uc is not None:
emit(uc)
s = s[2:]
continue
# \x?? hex
if c == b'x': # XXX also handle octals?
if len(s) < 2+2:
raise ValueError('unexpected EOL after \\x')
b = codecs.decode(s[2:2+2], 'hex')
emit(b)
s = s[2+2:]
continue
raise ValueError('invalid escape \\%s' % chr(ord(c[0:0+1])))
return b''.join(outv), s
cdef _unicodedata_category = unicodedata.category
#ifndef _NXD_LIBGOLANG_FMT_H
#define _NXD_LIBGOLANG_FMT_H
// Copyright (C) 2019-2023 Nexedi SA and Contributors.
// Copyright (C) 2019-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -111,7 +111,7 @@ inline error errorf(const string& format, Argv... argv) {
// `const char *` overloads just to catch format mistakes as
// __attribute__(format) does not work with std::string.
LIBGOLANG_API string sprintf(const char *format, ...)
#ifndef _MSC_VER
#ifndef LIBGOLANG_CC_msc
__attribute__ ((format (printf, 1, 2)))
#endif
;
......
This diff is collapsed.
This diff is collapsed.
......@@ -169,6 +169,8 @@
// [1] Libtask: a Coroutine Library for C and Unix. https://swtch.com/libtask.
// [2] http://9p.io/magic/man2html/2/thread.
#include "golang/runtime/platform.h"
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
......@@ -177,21 +179,18 @@
#include <sys/stat.h>
#include <fcntl.h>
#ifdef _MSC_VER // no mode_t on msvc
#ifdef LIBGOLANG_CC_msc // no mode_t on msvc
typedef int mode_t;
#endif
// DSO symbols visibility (based on https://gcc.gnu.org/wiki/Visibility)
#if defined _WIN32 || defined __CYGWIN__
#ifdef LIBGOLANG_OS_windows
#define LIBGOLANG_DSO_EXPORT __declspec(dllexport)
#define LIBGOLANG_DSO_IMPORT __declspec(dllimport)
#elif __GNUC__ >= 4
#else
#define LIBGOLANG_DSO_EXPORT __attribute__ ((visibility ("default")))
#define LIBGOLANG_DSO_IMPORT __attribute__ ((visibility ("default")))
#else
#define LIBGOLANG_DSO_EXPORT
#define LIBGOLANG_DSO_IMPORT
#endif
#if BUILDING_LIBGOLANG
......@@ -438,6 +437,10 @@ constexpr Nil nil = nullptr;
// string is alias for std::string.
using string = std::string;
// byte/rune types related to string.
using byte = uint8_t;
using rune = int32_t;
// func is alias for std::function.
template<typename F>
using func = std::function<F>;
......
// Copyright (C) 2019-2023 Nexedi SA and Contributors.
// Copyright (C) 2019-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -38,7 +38,7 @@
// cut this short
// (on darwing sys_siglist declaration is normally provided)
// (on windows sys_siglist is not available at all)
#if !(defined(__APPLE__) || defined(_WIN32))
#if !(defined(LIBGOLANG_OS_darwin) || defined(LIBGOLANG_OS_windows))
extern "C" {
extern const char * const sys_siglist[];
}
......@@ -287,7 +287,7 @@ string Signal::String() const {
const Signal& sig = *this;
const char *sigstr = nil;
#ifdef _WIN32
#ifdef LIBGOLANG_OS_windows
switch (sig.signo) {
case SIGABRT: return "Aborted";
case SIGBREAK: return "Break";
......
#ifndef _NXD_LIBGOLANG_OS_H
#define _NXD_LIBGOLANG_OS_H
//
// Copyright (C) 2019-2023 Nexedi SA and Contributors.
// Copyright (C) 2019-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -96,7 +96,7 @@ private:
// Open opens file @path.
LIBGOLANG_API std::tuple<File, error> Open(const string &path, int flags = O_RDONLY,
mode_t mode =
#if !defined(_MSC_VER)
#if !defined(LIBGOLANG_CC_msc)
S_IRUSR | S_IWUSR | S_IXUSR |
S_IRGRP | S_IWGRP | S_IXGRP |
S_IROTH | S_IWOTH | S_IXOTH
......
// Copyright (C) 2021-2023 Nexedi SA and Contributors.
// Copyright (C) 2021-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -89,7 +89,7 @@
#include <atomic>
#include <tuple>
#if defined(_WIN32)
#if defined(LIBGOLANG_OS_windows)
# include <windows.h>
#endif
......@@ -101,7 +101,7 @@
# define debugf(format, ...) do {} while (0)
#endif
#if defined(_MSC_VER)
#ifdef LIBGOLANG_CC_msc
# define HAVE_SIGACTION 0
#else
# define HAVE_SIGACTION 1
......@@ -194,7 +194,7 @@ void _init() {
if (err != nil)
panic("os::newFile(_wakerx");
_waketx = vfd[1];
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
if (sys::Fcntl(_waketx, F_SETFL, O_NONBLOCK) < 0)
panic("fcntl(_waketx, O_NONBLOCK)"); // TODO +syserr
#else
......
# Copyright (C) 2019-2023 Nexedi SA and Contributors.
# Copyright (C) 2019-2024 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -212,9 +212,11 @@ def _with_build_defaults(name, kw): # -> (pygo, kw')
dependv = kw.get('depends', [])[:]
dependv.extend(['%s/golang/%s' % (pygo, _) for _ in [
'libgolang.h',
'runtime.h',
'runtime/internal.h',
'runtime/internal/atomic.h',
'runtime/internal/syscall.h',
'runtime/platform.h',
'context.h',
'cxx.h',
'errors.h',
......@@ -226,6 +228,7 @@ def _with_build_defaults(name, kw): # -> (pygo, kw')
'os.h',
'os/signal.h',
'pyx/runtime.h',
'unicode/utf8.h',
'_testing.h',
'_compat/windows/strings.h',
'_compat/windows/unistd.h',
......@@ -264,6 +267,8 @@ def Extension(name, sources, **kw):
'_fmt.pxd',
'io.pxd',
'_io.pxd',
'strconv.pxd',
'_strconv.pxd',
'strings.pxd',
'sync.pxd',
'_sync.pxd',
......@@ -274,6 +279,8 @@ def Extension(name, sources, **kw):
'os/signal.pxd',
'os/_signal.pxd',
'pyx/runtime.pxd',
'unicode/utf8.pxd',
'unicode/_utf8.pxd',
]])
kw['depends'] = dependv
......
// Copyright (C) 2023-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package runtime mirrors Go package runtime.
// See runtime.h for package overview.
#include "golang/runtime.h"
// golang::runtime::
namespace golang {
namespace runtime {
const string OS =
#ifdef LIBGOLANG_OS_linux
"linux"
#elif defined(LIBGOLANG_OS_darwin)
"darwin"
#elif defined(LIBGOLANG_OS_windows)
"windows"
#else
# error
#endif
;
const string CC =
#ifdef LIBGOLANG_CC_gcc
"gcc"
#elif defined(LIBGOLANG_CC_clang)
"clang"
#elif defined(LIBGOLANG_CC_msc)
"msc"
#else
# error
#endif
;
}} // golang::runtime::
#ifndef _NXD_LIBGOLANG_RUNTIME_H
#define _NXD_LIBGOLANG_RUNTIME_H
// Copyright (C) 2023-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package runtime mirrors Go package runtime.
#include "golang/libgolang.h"
// golang::runtime::
namespace golang {
namespace runtime {
// OS indicates operating system, that is running the program.
//
// e.g. "linux", "darwin", "windows", ...
extern LIBGOLANG_API const string OS;
// CC indicates C/C++ compiler, that compiled the program.
//
// e.g. "gcc", "clang", "msc", ...
extern LIBGOLANG_API const string CC;
}} // golang::runtime::
#endif // _NXD_LIBGOLANG_RUNTIME_H
......@@ -40,7 +40,7 @@ ELSE:
from gevent import sleep as pygsleep
from libc.stdint cimport uint8_t, uint64_t, UINT64_MAX
from libc.stdint cimport uint64_t, UINT64_MAX
cdef extern from *:
ctypedef bint cbool "bool"
......@@ -52,7 +52,7 @@ from golang.runtime._libgolang cimport _libgolang_runtime_ops, _libgolang_sema,
from golang.runtime.internal cimport syscall
from golang.runtime cimport _runtime_thread
from golang.runtime._runtime_pymisc cimport PyExc, pyexc_fetch, pyexc_restore
from golang cimport topyexc
from golang cimport byte, topyexc
from libc.stdlib cimport calloc, free
from libc.errno cimport EBADF
......@@ -351,7 +351,7 @@ cdef nogil:
cdef:
bint _io_read(IOH* ioh, int* out_n, void *buf, size_t count):
pygfobj = <object>ioh.pygfobj
cdef uint8_t[::1] mem = <uint8_t[:count]>buf
cdef byte[::1] mem = <byte[:count]>buf
xmem = memoryview(mem) # to avoid https://github.com/cython/cython/issues/3900 on mem[:0]=b''
try:
# NOTE buf might be on stack, so it must not be accessed, e.g. from
......@@ -388,7 +388,7 @@ cdef nogil:
cdef:
bint _io_write(IOH* ioh, int* out_n, const void *buf, size_t count):
pygfobj = <object>ioh.pygfobj
cdef const uint8_t[::1] mem = <const uint8_t[:count]>buf
cdef const byte[::1] mem = <const byte[:count]>buf
# NOTE buf might be on stack, so it must not be accessed, e.g. from
# FileObjectThread, while our greenlet is parked (see STACK_DEAD_WHILE_PARKED
......
// Copyright (C) 2022-2023 Nexedi SA and Contributors.
// Copyright (C) 2022-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -20,7 +20,7 @@
#include "golang/runtime/internal/atomic.h"
#include "golang/libgolang.h"
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
#include <pthread.h>
#endif
......@@ -44,7 +44,7 @@ static void _forkNewEpoch() {
void _init() {
// there is no fork on windows
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
int e = pthread_atfork(/*prepare*/nil, /*inparent*/nil, /*inchild*/_forkNewEpoch);
if (e != 0)
panic("pthread_atfork failed");
......
// Copyright (C) 2021-2023 Nexedi SA and Contributors.
// Copyright (C) 2021-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -58,9 +58,9 @@ string _Errno::Error() {
char ebuf[128];
bool ok;
#if __APPLE__
#ifdef LIBGOLANG_OS_darwin
ok = (::strerror_r(-e.syserr, ebuf, sizeof(ebuf)) == 0);
#elif defined(_WIN32)
#elif defined(LIBGOLANG_OS_windows)
ok = (::strerror_s(ebuf, sizeof(ebuf), -e.syserr) == 0);
#else
char *estr = ::strerror_r(-e.syserr, ebuf, sizeof(ebuf));
......@@ -102,7 +102,7 @@ __Errno Close(int fd) {
return err;
}
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
__Errno Fcntl(int fd, int cmd, int arg) {
int save_errno = errno;
int err = ::fcntl(fd, cmd, arg);
......@@ -124,7 +124,7 @@ __Errno Fstat(int fd, struct ::stat *out_st) {
int Open(const char *path, int flags, mode_t mode) {
int save_errno = errno;
#ifdef _WIN32 // default to open files in binary mode
#ifdef LIBGOLANG_OS_windows // default to open files in binary mode
if ((flags & (_O_TEXT | _O_BINARY)) == 0)
flags |= _O_BINARY;
#endif
......@@ -141,9 +141,9 @@ __Errno Pipe(int vfd[2], int flags) {
return -EINVAL;
int save_errno = errno;
int err;
#ifdef __linux__
#ifdef LIBGOLANG_OS_linux
err = ::pipe2(vfd, flags);
#elif defined(_WIN32)
#elif defined(LIBGOLANG_OS_windows)
err = ::_pipe(vfd, 4096, flags | _O_BINARY);
#else
err = ::pipe(vfd);
......@@ -167,7 +167,7 @@ out:
return err;
}
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
__Errno Sigaction(int signo, const struct ::sigaction *act, struct ::sigaction *oldact) {
int save_errno = errno;
int err = ::sigaction(signo, act, oldact);
......
#ifndef _NXD_LIBGOLANG_RUNTIME_INTERNAL_SYSCALL_H
#define _NXD_LIBGOLANG_RUNTIME_INTERNAL_SYSCALL_H
// Copyright (C) 2021-2023 Nexedi SA and Contributors.
// Copyright (C) 2021-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
......@@ -63,13 +63,13 @@ LIBGOLANG_API int/*n|err*/ Read(int fd, void *buf, size_t count);
LIBGOLANG_API int/*n|err*/ Write(int fd, const void *buf, size_t count);
LIBGOLANG_API __Errno Close(int fd);
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
LIBGOLANG_API __Errno Fcntl(int fd, int cmd, int arg);
#endif
LIBGOLANG_API __Errno Fstat(int fd, struct ::stat *out_st);
LIBGOLANG_API int/*fd|err*/ Open(const char *path, int flags, mode_t mode);
LIBGOLANG_API __Errno Pipe(int vfd[2], int flags);
#ifndef _WIN32
#ifndef LIBGOLANG_OS_windows
LIBGOLANG_API __Errno Sigaction(int signo, const struct ::sigaction *act, struct ::sigaction *oldact);
#endif
typedef void (*sighandler_t)(int);
......
......@@ -52,7 +52,7 @@
#include <linux/list.h>
// MSVC does not support statement expressions and typeof
// -> redo list_entry via C++ lambda.
#ifdef _MSC_VER
#ifdef LIBGOLANG_CC_msc
# undef list_entry
# define list_entry(ptr, type, member) [&]() { \
const decltype( ((type *)0)->member ) *__mptr = (ptr); \
......
#ifndef _NXD_LIBGOLANG_RUNTIME_PLATFORM_H
#define _NXD_LIBGOLANG_RUNTIME_PLATFORM_H
// Copyright (C) 2023-2024 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Header platform.h provides preprocessor defines that describe target platform.
// LIBGOLANG_OS_<X> is defined on operating system X.
//
// List of supported operating systems: linux, darwin, windows.
#ifdef __linux__
# define LIBGOLANG_OS_linux 1
#elif defined(__APPLE__)
# define LIBGOLANG_OS_darwin 1
#elif defined(_WIN32) || defined(__CYGWIN__)
# define LIBGOLANG_OS_windows 1
#else
# error "unsupported operating system"
#endif
// LIBGOLANG_CC_<X> is defined on C/C++ compiler X.
//
// List of supported compilers: gcc, clang, msc.
#ifdef __clang__
# define LIBGOLANG_CC_clang 1
#elif defined(_MSC_VER)
# define LIBGOLANG_CC_msc 1
// NOTE gcc comes last because e.g. clang and icc define __GNUC__ as well
#elif __GNUC__
# define LIBGOLANG_CC_gcc 1
#else
# error "unsupported compiler"
#endif
#endif // _NXD_LIBGOLANG_RUNTIME_PLATFORM_H
# cython: language_level=2
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package strconv provides Go-compatible string conversions.
See _strconv.pxd for package documentation.
"""
# redirect cimport: golang.strconv -> golang._strconv (see __init__.pxd for rationale)
from golang._strconv cimport *
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2022 Nexedi SA and Contributors.
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -21,174 +21,7 @@
from __future__ import print_function, absolute_import
import unicodedata, codecs
from six import text_type as unicode # py2: unicode py3: str
from six.moves import range as xrange
from golang import b, u
from golang._golang import _py_utf8_decode_rune as _utf8_decode_rune, _py_rune_error as _rune_error, _xunichr
# _bstr is like b but also returns whether input was unicode.
def _bstr(s): # -> sbytes, wasunicode
return b(s), isinstance(s, unicode)
# _ustr is like u but also returns whether input was bytes.
def _ustr(s): # -> sunicode, wasbytes
return u(s), isinstance(s, bytes)
# quote quotes unicode|bytes string into valid "..." unicode|bytes string always quoted with ".
def quote(s):
s, wasunicode = _bstr(s)
qs = _quote(s)
if wasunicode:
qs, _ = _ustr(qs)
return qs
def _quote(s):
assert isinstance(s, bytes)
outv = []
emit = outv.append
i = 0
while i < len(s):
c = s[i:i+1]
# fast path - ASCII only
if ord(c) < 0x80:
if c in b'\\"':
emit(b'\\'+c)
# printable ASCII
elif b' ' <= c <= b'\x7e':
emit(c)
# non-printable ASCII
elif c == b'\t':
emit(br'\t')
elif c == b'\n':
emit(br'\n')
elif c == b'\r':
emit(br'\r')
# everything else is non-printable
else:
emit(br'\x%02x' % ord(c))
i += 1
# slow path - full UTF-8 decoding + unicodedata
else:
r, size = _utf8_decode_rune(s[i:])
isize = i + size
# decode error - just emit raw byte as escaped
if r == _rune_error and size == 1:
emit(br'\x%02x' % ord(c))
# printable utf-8 characters go as is
elif unicodedata.category(_xunichr(r))[0] in _printable_cat0:
emit(s[i:isize])
# everything else goes in numeric byte escapes
else:
for j in xrange(i, isize):
emit(br'\x%02x' % ord(s[j:j+1]))
i = isize
return b'"' + b''.join(outv) + b'"'
# unquote decodes "-quoted unicode|byte string.
#
# ValueError is raised if there are quoting syntax errors.
def unquote(s):
us, tail = unquote_next(s)
if len(tail) != 0:
raise ValueError('non-empty tail after closing "')
return us
# unquote_next decodes next "-quoted unicode|byte string.
#
# it returns -> (unquoted(s), tail-after-")
#
# ValueError is raised if there are quoting syntax errors.
def unquote_next(s):
s, wasunicode = _bstr(s)
us, tail = _unquote_next(s)
if wasunicode:
us, _ = _ustr(us)
tail, _ = _ustr(tail)
return us, tail
def _unquote_next(s):
assert isinstance(s, bytes)
if len(s) == 0 or s[0:0+1] != b'"':
raise ValueError('no starting "')
outv = []
emit= outv.append
s = s[1:]
while 1:
r, width = _utf8_decode_rune(s)
if width == 0:
raise ValueError('no closing "')
if r == ord('"'):
s = s[1:]
break
# regular UTF-8 character
if r != ord('\\'):
emit(s[:width])
s = s[width:]
continue
if len(s) < 2:
raise ValueError('unexpected EOL after \\')
c = s[1:1+1]
# \<c> -> <c> ; c = \ "
if c in b'\\"':
emit(c)
s = s[2:]
continue
# \t \n \r
uc = None
if c == b't': uc = b'\t'
elif c == b'n': uc = b'\n'
elif c == b'r': uc = b'\r'
# accept also \a \b \v \f that Go might produce
# Python also decodes those escapes even though it does not produce them:
# https://github.com/python/cpython/blob/2.7.18-0-g8d21aa21f2c/Objects/stringobject.c#L677-L688
elif c == b'a': uc = b'\x07'
elif c == b'b': uc = b'\x08'
elif c == b'v': uc = b'\x0b'
elif c == b'f': uc = b'\x0c'
if uc is not None:
emit(uc)
s = s[2:]
continue
# \x?? hex
if c == b'x': # XXX also handle octals?
if len(s) < 2+2:
raise ValueError('unexpected EOL after \\x')
b = codecs.decode(s[2:2+2], 'hex')
emit(b)
s = s[2+2:]
continue
raise ValueError('invalid escape \\%s' % chr(ord(c[0:0+1])))
return b''.join(outv), s
_printable_cat0 = frozenset(['L', 'N', 'P', 'S']) # letters, numbers, punctuation, symbols
from golang._strconv import \
pyquote as quote, \
pyunquote as unquote, \
pyunquote_next as unquote_next
# -*- coding: utf-8 -*-
# Copyright (C) 2018-2022 Nexedi SA and Contributors.
# Copyright (C) 2018-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
......@@ -20,12 +20,16 @@
from __future__ import print_function, absolute_import
from golang import bstr
from golang.strconv import quote, unquote, unquote_next
from golang.gcompat import qq
from six import int2byte as bchr, PY3
from six import int2byte as bchr
from six.moves import range as xrange
from pytest import raises
from pytest import raises, mark
import codecs
def byterange(start, stop):
b = b""
......@@ -34,16 +38,9 @@ def byterange(start, stop):
return b
# asstr converts unicode|bytes to str type of current python.
def asstr(s):
if PY3:
if isinstance(s, bytes):
s = s.decode('utf-8')
# PY2
else:
if isinstance(s, unicode):
s = s.encode('utf-8')
return s
def assert_bstreq(x, y):
assert type(x) is bstr
assert x == y
def test_quote():
testv = (
......@@ -72,6 +69,9 @@ def test_quote():
(u'\ufffd', u'�'),
)
# quote/unquote* always give bstr
BEQ = assert_bstreq
for tin, tquoted in testv:
# quote(in) == quoted
# in = unquote(quoted)
......@@ -79,14 +79,13 @@ def test_quote():
tail = b'123' if isinstance(tquoted, bytes) else '123'
tquoted = q + tquoted + q # add lead/trail "
assert quote(tin) == tquoted
assert unquote(tquoted) == tin
assert unquote_next(tquoted) == (tin, type(tin)())
assert unquote_next(tquoted + tail) == (tin, tail)
BEQ(quote(tin), tquoted)
BEQ(unquote(tquoted), tin)
_, __ = unquote_next(tquoted); BEQ(_, tin); BEQ(__, "")
_, __ = unquote_next(tquoted + tail); BEQ(_, tin); BEQ(__, tail)
with raises(ValueError): unquote(tquoted + tail)
# qq always gives str
assert qq(tin) == asstr(tquoted)
BEQ(qq(tin), tquoted)
# also check how it works on complementary unicode/bytes input type
if isinstance(tin, bytes):
......@@ -103,14 +102,13 @@ def test_quote():
tquoted = tquoted.encode('utf-8')
tail = tail.encode('utf-8')
assert quote(tin) == tquoted
assert unquote(tquoted) == tin
assert unquote_next(tquoted) == (tin, type(tin)())
assert unquote_next(tquoted + tail) == (tin, tail)
BEQ(quote(tin), tquoted)
BEQ(unquote(tquoted), tin)
_, __ = unquote_next(tquoted); BEQ(_, tin); BEQ(__, "")
_, __ = unquote_next(tquoted + tail); BEQ(_, tin); BEQ(__, tail)
with raises(ValueError): unquote(tquoted + tail)
# qq always gives str
assert qq(tin) == asstr(tquoted)
BEQ(qq(tin), tquoted)
# verify that non-canonical quotation can be unquoted too.
......@@ -143,3 +141,52 @@ def test_unquote_bad():
with raises(ValueError) as exc:
unquote(tin)
assert exc.value.args == (err,)
# ---- benchmarks ----
# quoting + unquoting
uchar_testv = ['a', # ascii
u'α', # 2-bytes utf8
u'\u65e5', # 3-bytes utf8
u'\U0001f64f'] # 4-bytes utf8
@mark.parametrize('ch', uchar_testv)
def bench_quote(b, ch):
s = bstr_ch1000(ch)
q = quote
for i in xrange(b.N):
q(s)
def bench_stdquote(b):
s = b'a'*1000
q = repr
for i in xrange(b.N):
q(s)
@mark.parametrize('ch', uchar_testv)
def bench_unquote(b, ch):
s = bstr_ch1000(ch)
s = quote(s)
unq = unquote
for i in xrange(b.N):
unq(s)
def bench_stdunquote(b):
s = b'"' + b'a'*1000 + b'"'
escape_decode = codecs.escape_decode
def unq(s): return escape_decode(s[1:-1])[0]
for i in xrange(b.N):
unq(s)
# bstr_ch1000 returns bstr with many repetitions of character ch occupying ~ 1000 bytes.
def bstr_ch1000(ch): # -> bstr
assert len(ch) == 1
s = bstr(ch)
s = s * (1000 // len(s))
if len(s) % 3 == 0:
s += 'x'
assert len(s) == 1000
return s
......@@ -18,7 +18,7 @@
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""This program helps to verify _pystr and _pyunicode.
"""This program helps to verify b, u and underlying bstr and ustr.
It complements golang_str_test.test_strings_print.
"""
......@@ -31,8 +31,17 @@ from golang.gcompat import qq
def main():
sb = b("привет αβγ b")
su = u("привет αβγ u")
print("print(b):", sb)
print("print(u):", su)
print("print(qq(b)):", qq(sb))
print("print(qq(u)):", qq(su))
print("print(repr(b)):", repr(sb))
print("print(repr(u)):", repr(su))
# py2: print(dict) calls PyObject_Print(flags=0) for both keys and values,
# not with flags=Py_PRINT_RAW used by default almost everywhere else.
# this way we can verify whether bstr.tp_print handles flags correctly.
print("print({b: u}):", {sb: su})
if __name__ == '__main__':
......
print(b): привет αβγ b
print(u): привет αβγ u
print(qq(b)): "привет αβγ b"
print(qq(u)): "привет αβγ u"
print(repr(b)): b('привет αβγ b')
print(repr(u)): u('привет αβγ u')
print({b: u}): {b('привет αβγ b'): u('привет αβγ u')}
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright (C) 2022-2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""This program helps to verify [:] handling for bstr and ustr.
It complements golang_str_test.test_strings_index2.
It needs to verify [:] only lightly because thorough verification is done in
test_string_index, and here we need to verify only that __getslice__, inherited
from builtin str/unicode, does not get into our way.
"""
from __future__ import print_function, absolute_import
from golang import b, u, bstr, ustr
from golang.gcompat import qq
def main():
us = u("миру мир")
bs = b("миру мир")
def emit(what, uobj, bobj):
assert type(uobj) is ustr
assert type(bobj) is bstr
print("u"+what, qq(uobj))
print("b"+what, qq(bobj))
emit("s", us, bs)
emit("s[:]", us[:], bs[:])
emit("s[0:1]", us[0:1], bs[0:1])
emit("s[0:2]", us[0:2], bs[0:2])
emit("s[1:2]", us[1:2], bs[1:2])
emit("s[0:-1]", us[0:-1], bs[0:-1])
if __name__ == '__main__':
main()
us "миру мир"
bs "миру мир"
us[:] "миру мир"
bs[:] "миру мир"
us[0:1] "м"
bs[0:1] "\xd0"
us[0:2] "ми"
bs[0:2] "м"
us[1:2] "и"
bs[1:2] "\xbc"
us[0:-1] "миру ми"
bs[0:-1] "миру ми\xd1"
# cython: language_level=2
# Copyright (C) 2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package utf8 mirrors Go package utf8.
See https://golang.org/pkg/unicode/utf8 for Go utf8 package documentation.
"""
from golang cimport rune
cdef extern from "golang/unicode/utf8.h" namespace "golang::unicode::utf8" nogil:
rune RuneError
#ifndef _NXD_LIBGOLANG_UNICODE_UTF8_H
#define _NXD_LIBGOLANG_UNICODE_UTF8_H
// Copyright (C) 2023 Nexedi SA and Contributors.
// Kirill Smelkov <kirr@nexedi.com>
//
// This program is free software: you can Use, Study, Modify and Redistribute
// it under the terms of the GNU General Public License version 3, or (at your
// option) any later version, as published by the Free Software Foundation.
//
// You can also Link and Combine this program with other software covered by
// the terms of any of the Free Software licenses or any of the Open Source
// Initiative approved licenses and Convey the resulting work. Corresponding
// source of such a combination shall include the source code for all other
// software used.
//
// This program is distributed WITHOUT ANY WARRANTY; without even the implied
// warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
//
// See COPYING file for full licensing terms.
// See https://www.nexedi.com/licensing for rationale and options.
// Package utf8 mirrors Go package utf8.
#include <golang/libgolang.h>
// golang::unicode::utf8::
namespace golang {
namespace unicode {
namespace utf8 {
constexpr rune RuneError = 0xFFFD; // unicode replacement character
}}} // golang::os::utf8::
#endif // _NXD_LIBGOLANG_UNICODE_UTF8_H
# cython: language_level=2
# Copyright (C) 2023 Nexedi SA and Contributors.
# Kirill Smelkov <kirr@nexedi.com>
#
# This program is free software: you can Use, Study, Modify and Redistribute
# it under the terms of the GNU General Public License version 3, or (at your
# option) any later version, as published by the Free Software Foundation.
#
# You can also Link and Combine this program with other software covered by
# the terms of any of the Free Software licenses or any of the Open Source
# Initiative approved licenses and Convey the resulting work. Corresponding
# source of such a combination shall include the source code for all other
# software used.
#
# This program is distributed WITHOUT ANY WARRANTY; without even the implied
# warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
#
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
"""Package utf8 mirrors Go package utf8.
See _utf8.pxd for package documentation.
"""
# redirect cimport: golang.unicode.utf8 -> golang.unicode._utf8 (see __init__.pxd for rationale)
from golang.unicode._utf8 cimport *
......@@ -71,6 +71,12 @@ def test_golang_builtins():
assert error is golang.error
assert b is golang.b
assert u is golang.u
assert bstr is golang.bstr
assert ustr is golang.ustr
assert biter is golang.biter
assert uiter is golang.uiter
assert bbyte is golang.bbyte
assert uchr is golang.uchr
# indirectly verify golang.__all__
for k in golang.__all__:
......
......@@ -19,6 +19,25 @@
# See COPYING file for full licensing terms.
# See https://www.nexedi.com/licensing for rationale and options.
# patch cython to allow `cdef class X(bytes)` while building pygolang to
# workaround https://github.com/cython/cython/issues/711
# see `cdef class pybstr` in golang/_golang_str.pyx for details.
# (should become unneeded with cython 3 once https://github.com/cython/cython/pull/5212 is finished)
import inspect
from Cython.Compiler.PyrexTypes import BuiltinObjectType
def pygo_cy_builtin_type_name_set(self, v):
self._pygo_name = v
def pygo_cy_builtin_type_name_get(self):
name = self._pygo_name
if name == 'bytes':
caller = inspect.currentframe().f_back.f_code.co_name
if caller == 'analyse_declarations':
# need anything different from 'bytes' to deactivate check in
# https://github.com/cython/cython/blob/c21b39d4/Cython/Compiler/Nodes.py#L4759-L4762
name = 'xxx'
return name
BuiltinObjectType.name = property(pygo_cy_builtin_type_name_get, pygo_cy_builtin_type_name_set)
from setuptools import find_packages
from setuptools.command.install_scripts import install_scripts as _install_scripts
from setuptools.command.develop import develop as _develop
......@@ -166,7 +185,8 @@ for pkg in R:
R['all'] = Rall
# ipython/pytest are required to test py2 integration patches
R['all_test'] = Rall.union(['ipython', 'pytest']) # pip does not like "+" in all+test
# zodbpickle is used to test pickle support for bstr/ustr
R['all_test'] = Rall.union(['ipython', 'pytest', 'zodbpickle']) # pip does not like "+" in all+test
# extras_require <- R
extras_require = {}
......@@ -207,6 +227,7 @@ setup(
['golang/runtime/libgolang.cpp',
'golang/runtime/internal/atomic.cpp',
'golang/runtime/internal/syscall.cpp',
'golang/runtime.cpp',
'golang/context.cpp',
'golang/errors.cpp',
'golang/fmt.cpp',
......@@ -218,9 +239,11 @@ setup(
'golang/time.cpp'],
depends = [
'golang/libgolang.h',
'golang/runtime.h',
'golang/runtime/internal.h',
'golang/runtime/internal/atomic.h',
'golang/runtime/internal/syscall.h',
'golang/runtime/platform.h',
'golang/context.h',
'golang/cxx.h',
'golang/errors.h',
......@@ -249,7 +272,9 @@ setup(
ext_modules = [
Ext('golang._golang',
['golang/_golang.pyx'],
depends = ['golang/_golang_str.pyx']),
depends = [
'golang/_golang_str.pyx',
'golang/_golang_str_pickle.pyx']),
Ext('golang.runtime._runtime_thread',
['golang/runtime/_runtime_thread.pyx']),
......@@ -301,6 +326,9 @@ setup(
Ext('golang.os._signal',
['golang/os/_signal.pyx']),
Ext('golang._strconv',
['golang/_strconv.pyx']),
Ext('golang._strings_test',
['golang/_strings_test.pyx',
'golang/strings_test.cpp']),
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment