Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Support
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
C
cython
Project overview
Project overview
Details
Activity
Releases
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Labels
Merge Requests
0
Merge Requests
0
Analytics
Analytics
Repository
Value Stream
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Commits
Open sidebar
nexedi
cython
Commits
a670d2ac
Commit
a670d2ac
authored
Jan 25, 2014
by
Stefan Behnel
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
first start at a string tutorial section on accepting strings from Python code
parent
6722061c
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
98 additions
and
0 deletions
+98
-0
docs/src/tutorial/strings.rst
docs/src/tutorial/strings.rst
+98
-0
No files found.
docs/src/tutorial/strings.rst
View file @
a670d2ac
...
...
@@ -170,6 +170,104 @@ object buffer after converting it to a C string pointer. These
modifications can change the internal buffer address, which will make
the pointer invalid.
Accepting strings from Python code
----------------------------------
The other side, receiving input from Python code, may appear simple
at first sight, as it only deals with objects. However, getting this
right without making the API too narrow or too unsafe may not be
entirely obvious.
In the case that the API only deals with byte strings, i.e. binary
data or encoded text, it is best not to type the input argument as
something like :obj:`bytes`, because that would restrict the allowed
input to exactly that type and exclude both subtypes and other kinds
of byte containers, e.g. :obj:`bytearray` objects or memory views.
Depending on how (and where) the data is being processed, it may be a
good idea to instead receive a 1-dimensional memory view, e.g.
def process_byte_data(unsigned char[:] data):
length = data.shape[0]
first_byte = data[0]
byte_slice = data[1:-1]
...
Cython's memory views are described in more detail in :doc:`memoryviews`,
but the above example already shows most of the relevant functionality
for 1-dimensional byte views. They allow for efficient processing of
arrays and accept anything that can unpack itself into a byte buffer,
without intermediate copying. The processed content can finally be
returned in the memory view itself (or a slice of it), but it is
often better to copy the data back into a :obj:`bytes` or :obj:`bytearray`
object, especially when only a small slice is returned (as the memoryview
would otherwise keep the entire original buffer alive). This can simply
be done as follows::
def process_byte_data(unsigned char[:] data):
# ... process the data
if return_all:
return bytes(data)
else:
# example for returning a slice
return bytes(data[5:35])
If the byte input is actually encoded text, and the further processing
should happen at the Unicode level, then the right thing to do is to
decode the input straight away. This is almost only a problem in Python
2.x, where Python code expects that it can pass a byte string (:obj:`str`)
with encoded text into a text API. Since this usually happens in more
than one place in the module's API, a helper function is almost always the
way to go, since it allows for easy adaptation of the input normalisation
process later.
This kind of input normalisation function will commonly look similar to
the following::
from cpython.version cimport PY_MAJOR_VERSION
cdef unicode _ustring(s):
if type(s) is unicode:
# fast path for most common case(s)
return <unicode>s
elif PY_MAJOR_VERSION < 3 and isinstance(s, bytes):
# only accept byte strings in Python 2.x, not in Py3
return (<bytes>s).decode('ascii')
elif isinstance(s, unicode):
# an evil cast to <unicode> might work here in some(!) cases,
# depending on what the further processing does. to be safe,
# we can always create a copy instead
return unicode(s)
else:
raise TypeError(...)
And should then be used like this::
def api_func(s):
text = _ustring(s)
...
Similarly, if the further processing happens at the byte level, but Unicode
string input should be accepted, then the following might work, if you are
using memory views::
# define a global name for whatever char type is used in the module
ctypedef unsigned char char_type
cdef char_type[:] _chars(s):
if isinstance(s, unicode):
# encode to the specific encoding used inside of the module
s = (<unicode>s).encode('utf8')
return s
In this case, you might want to additionally ensure that byte string
input really uses the correct encoding, e.g. if you require pure ASCII
input data, you can run over the buffer in a loop and check the highest
bit of each byte. This should then also be done in the input normalisation
function.
Dealing with "const"
--------------------
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment