1. 30 Sep, 2024 1 commit
    • Kirill Smelkov's avatar
      *: interface{} -> any (#81) · b17a70e0
      Kirill Smelkov authored
      Since 64d5926e (Drop support for Go < 1.18) and 3bf6c92d (Add PyDict
      mode) ogórek started to require go ≥ 1.18 that added generics. That
      release also added `any` type alias for interface{} to increase
      signal/noise ratio in the programs where interface{} is frequently used.
      
      Any is aliased to interface{} similarly to how rune is aliased to
      int32, and so any and interface{} are fully interchangeable, just any
      reads more shorter and concise:
      
          https://pkg.go.dev/builtin#any
      
      -> Replace interface{} with any throughout the whole codebase for the
         sources to be read more well.
      
      The replacement was done via running the following sed script:
      
          $ sed -i -e 's/interface{}/any/g' *.go
      b17a70e0
  2. 24 Sep, 2024 2 commits
  3. 23 Sep, 2024 3 commits
  4. 19 Sep, 2024 4 commits
    • Kirill Smelkov's avatar
      Add PyDict mode · 3bf6c92d
      Kirill Smelkov authored
      Similarly to StrictUnicode mode (see b28613c2) add new opt-in mode that
      requests to decode Python dicts as ogórek.Dict instead of builtin map.
      As explained in recent patch "Add custom Dict that mirrors Python dict
      behaviour" this is needed to fix decoding issues that can be there due
      to different behaviour of Python dict and builtin Go map:
      
          ---- 8< ----
          Ogórek currently represents unpickled dict via map[any]any, which is
          logical, but also exhibits issues because builtin Go map behaviour is
          different from Python's dict behaviour. For example:
      
          - Python's dict allows tuples to be used in keys, while Go map
            does not (https://github.com/kisielk/og-rek/issues/50),
      
          - Python's dict allows both long and int to be used interchangeable as
            keys, while Go map does not handle *big.Int as key with the same
            semantic (https://github.com/kisielk/og-rek/issues/55)
      
          - Python's dict allows to use numbers interchangeable in keys - all int
            and float, but on Go side int(1) and float64(1.0) are considered by
            builtin map as different keys.
      
          - In Python world bytestring (str from py2) is considered to be related
            to both unicode (str on py3) and bytes, but builtin map considers all
            string, Bytes and ByteString as different keys.
      
          - etc...
      
          All in all there are many differences in behaviour in builtin Python
          dict and Go map that result in generally different semantics when
          decoding pickled data. Those differences can be fixed only if we add
          custom dict implementation that mirrors what Python does.
      
          -> Do that: add custom Dict that implements key -> value mapping with
             mirroring Python behaviour.
      
          For now we are only adding the Dict class itself and its tests.
          Later we will use this new Dict to handle decoding dictionaries from the pickles.
          ---- 8< ----
      
      In this patch we add new Decoder option to activate PyDict mode
      decoding, teach encoder to also support encoding of Dict and adjust
      tests.
      
      The behaviour of new system is explained by the following doc.go
      excerpt:
      
          For dicts there are two modes. In the first, default, mode Python dicts are
          decoded into standard Go map. This mode tries to use builtin Go type, but
          cannot mirror py behaviour fully because e.g. int(1), big.Int(1) and
          float64(1.0) are all treated as different keys by Go, while Python treats
          them as being equal. It also does not support decoding dicts with tuple
          used in keys:
      
               dict      map[any]any                       PyDict=n mode, default
                       ←  ogórek.Dict
      
          With PyDict=y mode, however, Python dicts are decoded as ogórek.Dict which
          mirrors behaviour of Python dict with respect to keys equality, and with
          respect to which types are allowed to be used as keys.
      
               dict      ogórek.Dict                       PyDict=y mode
                       ←  map[any]any
      3bf6c92d
    • Kirill Smelkov's avatar
      *: Switch reflect.DeepEqual to own deepEqual that supports Dict · 8be3fcab
      Kirill Smelkov authored
      reflect.DeepEqual does not work with Dicts because each Dict uses own
      distinct random seed. But in the tests we will need to be able to
      compare arbitrary objects, including Dicts.
      
      -> Fix this via implementing our own deepEqual that extends
         reflect.DeepEqual to also support Dicts.
      
      Handle only top-level Dicts for now as that level of support will be
      enough for our tests.
      8be3fcab
    • Kirill Smelkov's avatar
      Add custom Dict that mirrors Python dict behaviour · 4178bf4a
      Kirill Smelkov authored
      Ogórek currently represents unpickled dict via map[any]any, which is
      logical, but also exhibits issues because builtin Go map behaviour is
      different from Python's dict behaviour. For example:
      
      - Python's dict allows tuples to be used in keys, while Go map
        does not (https://github.com/kisielk/og-rek/issues/50),
      
      - Python's dict allows both long and int to be used interchangeable as
        keys, while Go map does not handle *big.Int as key with the same
        semantic (https://github.com/kisielk/og-rek/issues/55)
      
      - Python's dict allows to use numbers interchangeable in keys - all int
        and float, but on Go side int(1) and float64(1.0) are considered by
        builtin map as different keys.
      
      - In Python world bytestring (str from py2) is considered to be related
        to both unicode (str on py3) and bytes, but builtin map considers all
        string, Bytes and ByteString as different keys.
      
      - etc...
      
      All in all there are many differences in behaviour in builtin Python
      dict and Go map that result in generally different semantics when
      decoding pickled data. Those differences can be fixed only if we add
      custom dict implementation that mirrors what Python does.
      
      -> Do that: add custom Dict that implements key -> value mapping with
         mirroring Python behaviour.
      
      For now we are only adding the Dict class itself and its tests.
      Later we will use this new Dict to handle decoding dictionaries from the pickles.
      
      For the implementation we use github.com/aristanetworks/gomap which
      provides extraction of builtin go map code wrapped into generic type
      Map[Key,Value] that accepts custom equal and hash functions. And it is
      those equal and hash functions via which we define equality behaviour to
      be the same as on Python side.
      
      Please see documentation in the added code and tests for details.
      4178bf4a
    • Kirill Smelkov's avatar
      Drop support for Go < 1.18 · 64d5926e
      Kirill Smelkov authored
      In the next patch we will need to use generics, which are only available
      starting from Go 1.18 . That Go 1.18 was released in spring of 2022 -
      almost 2.5 years ago - and is already EOL now. The older releases became
      EOL even earlier compared to Go 1.18 and are generally no longer
      supported by Go upstream. Dropping support for Go < 1.18 remains support
      for Go 1.18 - 1.23, i.e. 6 releases 3 of which are already in end of
      life state. So dropping support for Go < 1.18 should be, hopefully, ok.
      64d5926e
  5. 18 Sep, 2024 2 commits
    • Kirill Smelkov's avatar
      encoder: Fix data corruption when encoding uint64 · 773b7a86
      Kirill Smelkov authored
      I discovered that uint64 is wrongly encoded. For example
      uint64(0x80000000ffffffff) is encoded as
      
      	"I-9223372032559808513\n."
      
      instead of
      
      	"I9223372041149743103\n."
      
      which results in wrong data to be loaded on Python side:
      
      	In [1]: import pickle
      
      	In [2]: pickle.loads(b"I-9223372032559808513\n.")
      	Out[2]: -9223372032559808513
      
      	In [3]: pickle.loads(b"I9223372041149743103\n.")
      	Out[3]: 9223372041149743103
      
      Similarly uint64(0xffffffffffffffff) is wrongly encoded as
      
      	"I-1\n."
      
      instead of
      
      	"I18446744073709551615\n."
      
      with the same effect of misinforming peer:
      
      	In [4]: pickle.loads(b"I-1\n.")
      	Out[4]: -1
      
      	In [5]: pickle.loads(b"\x80\x02I18446744073709551615\n.")
      	Out[5]: 18446744073709551615
      
      in general any uint64 that is greater than max(int64) is encoded
      wrongly.
      
      -> Fix that by encoding those integer range properly and with INT
         opcode. Unfortunately there is no BININT* variant to handle integers
         out of int32 range so we need to use text encoding for large
         uint64 similarly to how we already do for large int64.
      
      For symmetry we need to adjust the decoder as well to handle those I...
      with numbers out of signed int64 range, because without the adjustment
      it fails as e.g.
      
          --- FAIL: TestDecode/uint(0x80000000ffffffff)/StrictUnicode=n/"I9223372041149743103\n." (0.00s)
              ogorek_test.go:619: strconv.ParseInt: parsing "9223372041149743103": value out of range
              ogorek_test.go:623: decode:
                  have: <nil>
                  want: 9223372041149743103
      
      Since Python handles such input just fine (see In[3]/Out[3] above) we
      should as well. The number cannot be represented as int64 though, so if
      the number coming with INT opcode is out of fixed int64 range, we decode
      it into big.Int .
      773b7a86
    • Kirill Smelkov's avatar
      tests: Fix Xloosy · 85183af0
      Kirill Smelkov authored
      In b28613c2 (Add StrictUnicode mode) I modified TestEntry to have
      .strictUnicodeN and .strictUnicodeY fields and adjusted X to set them
      both to true by default. However I missed to adjust Xloosy, which
      started to return TestEntry with both .strictUnicodeN=.strictUnicodeY=false
      which effectively disabled running test entries defined via Xloosy.
      
      -> Fix that by reusing X in Xloosy implementation.
      
      This brings in unnoticed test failures with structs and strings:
      
          --- FAIL: TestDecode (0.03s)
              --- FAIL: TestDecode/[]ogórek.foo{"Qux",_4}/StrictUnicode=y/"((S\"Foo\"\nS\"Qux\"\nS\"Bar\"\nI4\ndl." (0.00s)
                  ogorek_test.go:594: decode:
                      have: []interface {}{map[interface {}]interface {}{ogórek.ByteString("Bar"):4, ogórek.ByteString("Foo"):ogórek.ByteString("Qux")}}
                      want: []interface {}{map[interface {}]interface {}{"Bar":4, "Foo":"Qux"}}
              --- FAIL: TestDecode/[]ogórek.foo{"Qux",_4}/StrictUnicode=y/"((U\x03FooU\x03QuxU\x03BarK\x04dl." (0.00s)
                  ogorek_test.go:594: decode:
                      have: []interface {}{map[interface {}]interface {}{ogórek.ByteString("Bar"):4, ogórek.ByteString("Foo"):ogórek.ByteString("Qux")}}
                      want: []interface {}{map[interface {}]interface {}{"Bar":4, "Foo":"Qux"}}
          --- FAIL: TestEncode (0.11s)
              --- FAIL: TestEncode/[]ogórek.foo{"Qux",_4}/StrictUnicode=y/proto=0 (0.00s)
                  ogorek_test.go:670: encode:
                      have: "((VFoo\nVQux\nVBar\nI4\ndl."
                      want: "((S\"Foo\"\nS\"Qux\"\nS\"Bar\"\nI4\ndl."
              --- FAIL: TestEncode/[]ogórek.foo{"Qux",_4}/StrictUnicode=y/proto=1 (0.00s)
                  ogorek_test.go:670: encode:
                      have: "((X\x03\x00\x00\x00FooX\x03\x00\x00\x00QuxX\x03\x00\x00\x00BarK\x04dl."
                      want: "((U\x03FooU\x03QuxU\x03BarK\x04dl."
              --- FAIL: TestEncode/[]ogórek.foo{"Qux",_4}/StrictUnicode=y/proto=2 (0.00s)
                  ogorek_test.go:670: encode:
                      have: "\x80\x02((X\x03\x00\x00\x00FooX\x03\x00\x00\x00QuxX\x03\x00\x00\x00BarK\x04dl."
                      want: "\x80\x02((U\x03FooU\x03QuxU\x03BarK\x04dl."
      
      because in that test data are prepared to auto strings mode and started
      to fail with StrictUnicode=y. Fix that by adjusting that test to run in
      StrictUnicode=n mode only.
      85183af0
  6. 19 Aug, 2024 1 commit
  7. 16 Aug, 2024 6 commits
    • Kirill Smelkov's avatar
      ci: Also run benchmarks to make sure they do not become broken · 08605c7a
      Kirill Smelkov authored
      As recent fixes show it is easy to break benchmarks while adjusting
      tests without noticing. Prevent that to happen by running all benchmarks
      just once during each CI run.
      08605c7a
    • Kirill Smelkov's avatar
      Drop support for Go 1.9, Go 1.10 and Go 1.11 · 1d818b7a
      Kirill Smelkov authored
      Under those go versions `go test` does not support `-benchtime Nx`
      syntax, that we are going to use in the next patch, and fails with
      
          invalid value "1x" for flag -test.benchtime: time: unknown unit x in duration 1x
      
      Maybe it is not a big deal and if it were just `-benchtime Nx` we could
      try to invent and do something to try to support those old Go versions.
      However I have upcoming patches that will require Go 1.18 for generics,
      and from that point of view trying to keep support for Go version older
      than Go 1.18 would be not practical.
      1d818b7a
    • Kirill Smelkov's avatar
      tests: Fix BenchmarkDecode · 69e892b1
      Kirill Smelkov authored
      After 2c359fc1 (decoder: Remember protocol version as last seen in a
      PROTO opcode) BenchmarkDecode started to fail with
      
          --- FAIL: BenchmarkDecode
              ogorek_test.go:977: invalid pickle version
      
      That happens because some input pickles now come with `PROTO 0xff`
      prefix and version 0xff is indeed invalid.
      
      -> Fix it by adjusting that prefix to be `PROTO 3` during benchmark as 3
      is the most-widely used protocol version used by both py3 and py2 (via
      zodbpickle).
      69e892b1
    • Kirill Smelkov's avatar
      tests: Fix BenchmarkEncode · 1b57c3e4
      Kirill Smelkov authored
      After 6e5e403e (encoder: Fix GLOBAL emission wrt module/name with \n)
      BenchmarkEncode started to fail with
      
          --- FAIL: BenchmarkEncode
              ogorek_test.go:1003: protocol 0-3: global: module & name must be string without \n
      
      -> Fix it by taking as input only those testdata objects that are marked
      to encode without errors.
      1b57c3e4
    • Kirill Smelkov's avatar
      encoder: Fix int ranges for which BININT1 and BININT2 are emitted · 6ce3a6ee
      Kirill Smelkov authored
      BININT1 emits unsigned integer that fits into 1 byte.
      BININT2 emits unsigned integer that fits into 2 bytes.
      
      So they cover [0, 0xff] and [0, 0xffff] ranges including edges. However
      we were emitting BININTX opcodes only for numbers being strictly inside
      those ranges - i.e. (0, 0xff) and (0, 0xffff) - without edges. As the
      result what could be emitted as "K\x00." was emitted as
      "J\x00\x00\x00\x00." and so on.
      
      Fix it by emitting BININT1 for [0, 0xff] inclusive and BININT for [0,
      0xffff] also inclusive.
      6ce3a6ee
    • Kirill Smelkov's avatar
      decoder: Fix integer overflow in BINUNICODE handling · db0cbaf8
      Kirill Smelkov authored
      BINUNICODE opcode comes with 4-byte little-endian unsigned int length after it.
      However we were decoding it as signed int32 instead of unsigned uint32.
      
      As the result if last byte of this length is >= 0x80 the length
      was decoded as negative:
      
              y := int32(0x81)
              println(y << (3*8))
      
      	Output: -2130706432
      
      which was leading to empty unicode string returned instead of an error.
      
      -> Fix it by decoding BINUNICODE lenght as uint32.
      
      Without the fix added test fails as
      
      	--- FAIL: TestDecodeError (0.00s)
      	    ogorek_test.go:812: "X\xff\xff\xff\xff.": no decode error  ; got "", <nil>
      
      And for that "X\xff\xff\xff\xff." Python also rejects to load this
      pickle:
      
      	In [1]: p = b"X\xff\xff\xff\xff."
      
      	In [2]: pickle.loads(p)
      	---------------------------------------------------------------------------
      	UnpicklingError                           Traceback (most recent call last)
      	Cell In [2], line 1
      	----> 1 pickle.loads(p)
      
      	UnpicklingError: pickle data was truncated
      
      	In [3]: dis(p)
      	---------------------------------------------------------------------------
      	ValueError                                Traceback (most recent call last)
      	Cell In [3], line 1
      	----> 1 dis(p)
      
      	File /usr/lib/python3.11/pickletools.py:2448, in dis(pickle, out, memo, indentlevel, annotate)
      	   2446 errormsg = None
      	   2447 annocol = annotate  # column hint for annotations
      	-> 2448 for opcode, arg, pos in genops(pickle):
      	   2449     if pos is not None:
      	   2450         print("%5d:" % pos, end=' ', file=out)
      
      	File /usr/lib/python3.11/pickletools.py:2291, in _genops(data, yield_end_pos)
      	   2289     arg = None
      	   2290 else:
      	-> 2291     arg = opcode.arg.reader(data)
      	   2292 if yield_end_pos:
      	   2293     yield opcode, arg, pos, getpos()
      
      	File /usr/lib/python3.11/pickletools.py:693, in read_unicodestring4(f)
      	    691 if len(data) == n:
      	    692     return str(data, 'utf-8', 'surrogatepass')
      	--> 693 raise ValueError("expected %d bytes in a unicodestring4, but only %d "
      	    694                  "remain" % (n, len(data)))
      
      	ValueError: expected 4294967295 bytes in a unicodestring4, but only 1 remain
      db0cbaf8
  8. 17 Jul, 2024 1 commit
    • Kirill Smelkov's avatar
      *: Cosmetics · 465792bb
      Kirill Smelkov authored
      Fix noticed typos, clarify things in a couple of places. Only minor
      and documentation changes.
      
      /helped-by @kisielk
      465792bb
  9. 16 Jul, 2024 6 commits
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · 7d520718
      Kirill Smelkov authored
      Rerunning go-fuzz after recent patches resulted in some new fuzz test vectors.
      7d520718
    • Kirill Smelkov's avatar
      Add StrictUnicode mode · b28613c2
      Kirill Smelkov authored
      Up till now ogórek works relatively well for me but experience gained
      with ZODB/go and Wendelin.core highlighted two problems with strings:
      
      1) loading, then re-saving string data is not generally identity, and
      2) there is no way to distinguish binary data saved via py2 str from
         unicode strings saved via either py2 unicode or py3 str.
      
      Let me explain those problems:
      
      1) Loading, then re-saving string data is not generally identity
      ----------------------------------------------------------------
      
      On decoding ogórek currently loads both byte-string (*STRING opcodes)
      and unicode string (*UNICODE opcodes) into the same Go type string. And
      on encoding that string type is encoded as *STRING for protocol <= 2 and
      as *UNICODE for protocol >= 3. It was me to implement that encoding
      logic in 2018 in e7d96969 (encoder: Fix string wrt protocol version)
      where in particular I did
      
          - if protocol >= 3 we have to emit the string as unicode pickle object
            the same way as Python3 does. If we don't do - Python3 won't be
            generally able to load our pickle ...
      
      with the idea that protocol=3 always means that the pickle is intended
      to be for Python3.
      
      But there I missed that zodbpickle can use and generate pickles with
      protocol=3 even on py2 and that ZODB/py2 actually uses this protocol=3
      mode since https://github.com/zopefoundation/ZODB/commit/12ee41c4
      authored in the same 2018.
      
      So, there can be pickles saved under protocol=3 that do contain both
      *STRING and *UNICODE opcodes, and when ogórek sees those it loads that
      *STRING and *UNICODE data as just string loosing the information about
      type of particular variant. And then on encoding the data is saved as
      all UNICODE, or all STRING, thus breaking decode/encode=identity
      property.
      
      This breakage is there even with plain pickle and protocol=2 - when both
      *STRING and *UNICODE are in the database, ogórek loads them both as the
      same Go type string, and then saving back under the same protocol=2 goes
      as all STRING resulting in unicode objects becoming str on resave
      without any intended change.
      
      2) there is no way to distinguish binary data saved via py2 str from unicode strings saved via either py2 unicode or py3 str
      ----------------------------------------------------------------------------------------------------------------------------
      
      Continuing above example of py2 database with both *STRING and *UNICODE
      opcodes present there is currently no way for application to distinguish
      those from each other. In other words there is currently no way for the
      application to distinguish whether it is binary data coming from py2
      protocol <= 2 era, from unicode text.
      
      The latter problem I hit for real: with Wendelin.core we have lots of
      data saved from under Python2 as just str. And the Go part of
      Wendlin.core, upon loading block of data, wants to accept only binary -
      either bytes from py3 or bytestring from py2, but not unicode, because
      it indicates a mistake if e.g. a ZBlk object would come with unicode data
      
      https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/bigfile/file_zodb.py#L267-300
      https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L31-32
      https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L98-107
      
      but there is currently no way to distinguish whether it was unicode or
      bytestring saved into the database becase they both are represented as
      the same Go string type after decoding.
      
      ----------------------------------------
      
      So to solve those problems I thought it over and understood that the
      issues start to appear becase we let *STRING and *UNICODE to become
      mixed into the same entity on loading. This behaviour is there from
      ogórek beginning and the intention, it seems, was to get the data from
      the pickle stream in an easily-accepted form on the Go side. However the
      convenience turned out to come with cost of loosing some correctness in
      the general case as explained above.
      
      So if we are to fix the correctness we need to change that behaviour and
      load *STRING and *UNICODE opcodes into distinct types, so that the
      information about what was what is preserved and it becomes possible to
      distinguish bytestring from unicode strings and resave the data in
      exactly the same form as loaded. Though we can do this only under opt-in
      option with default behaviour staying as it was before to preserve
      backward compatibility.
      
      -> Do it.
      
      Below is excerpt from doc.go and DecoderConfig and EncoderConfig changes that
      describe the new system:
      
          For strings there are two modes. In the first, default, mode both py2/py3
          str and py2 unicode are decoded into string with py2 str being considered
          as UTF-8 encoded. Correspondingly for protocol ≤ 2 Go string is encoded as
          UTF-8 encoded py2 str, and for protocol ≥ 3 as py3 str / py2 unicode.
          ogórek.ByteString can be used to produce bytestring objects after encoding
          even for protocol ≥ 3. This mode tries to match Go string with str type of
          target Python depending on protocol version, but looses information after
          decoding/encoding cycle:
      
              py2/py3 str    string                       StrictUnicode=n mode, default
              py2 unicode  →  string
              py2 str      ←  ogórek.ByteString
      
          However with StrictUnicode=y mode there is 1-1 mapping in between py2
          unicode / py3 str vs Go string, and between py2 str vs ogórek.ByteString.
          In this mode decoding/encoding and encoding/decoding operations are always
          identity with respect to strings:
      
              py2 unicode / py3 str    string             StrictUnicode=y mode
              py2 str                  ogórek.ByteString
      
          For bytes, unconditionally to string mode, there is direct 1-1 mapping in
          between Python and Go types:
      
              bytes          ogórek.Bytes   (~)
              bytearray      []byte
      
          --------
      
          type DecoderConfig struct {
              // StrictUnicode, when true, requests to decode to Go string only
              // Python unicode objects. Python2 bytestrings (py2 str type) are
              // decoded into ByteString in this mode...
              StrictUnicode bool
          }
      
          type EncoderConfig struct {
              // StrictUnicode, when true, requests to always encode Go string
              // objects as Python unicode independently of used pickle protocol...
              StrictUnicode bool
          }
      
      Since strings are now split into two types, string and ByteString, and
      ByteString can either mean text or binary data, new AsString and AsBytes
      helpers are also added to handle string and binary data in uniform way
      supporting both py2 and py3 databases. Corresponding excerpts from
      doc.go and typeconv.go changes with the description of those helpers
      come below:
      
          On Python3 strings are unicode strings and binary data is represented by
          bytes type. However on Python2 strings are bytestrings and could contain
          both text and binary data. In the default mode py2 strings, the same way as
          py2 unicode, are decoded into Go strings. However in StrictUnicode mode py2
          strings are decoded into ByteString - the type specially dedicated to
          represent them on Go side. There are two utilities to help programs handle
          all those bytes/string data in the pickle stream in uniform way:
      
              - the program should use AsString if it expects text   data -
                either unicode string, or byte string.
              - the program should use AsBytes  if it expects binary data -
                either bytes, or byte string.
      
          Using the helpers fits into Python3 strings/bytes model but also allows to
          handle the data generated from under Python2.
      
          --------
      
          // AsString tries to represent unpickled value as string.
          //
          // It succeeds only if the value is either string, or ByteString.
          // It does not succeed if the value is Bytes or any other type.
          //
          // ByteString is treated related to string because ByteString represents str
          // type from py2 which can contain both string and binary data.
          func AsString(x interface{}) (string, error) {
      
          // AsBytes tries to represent unpickled value as Bytes.
          //
          // It succeeds only if the value is either Bytes, or ByteString.
          // It does not succeed if the value is string or any other type.
          //
          // ByteString is treated related to Bytes because ByteString represents str
          // type from py2 which can contain both string and binary data.
          func AsBytes(x interface{}) (Bytes, error) {
      
      ZODB/go and Wendelin.core intend to switch to using StrictUnicode mode
      while leaving ogórek to remain 100% backward-compatible in its default
      mode for other users.
      b28613c2
    • Kirill Smelkov's avatar
      Add AsInt64 helper to help programs handle unpickled data be it int or long · 010fbd2e
      Kirill Smelkov authored
      On Python two different objects with different types can represent
      essentially the same entity. For example 1 (int) and 1L (long) represent
      integer number one via two different types and are decoded by ogórek into Go
      types int64 and big.Int correspondingly. However on the Python side those
      two representations are often used interchangeably and programs are usually
      expected to handle both with the same effect.
      
      For example I hit this situation for real with FileStorage index, ZEO
      protocol, BTrees and Wendelin.core where data with integer fields, for
      the same field, are sometimes represented as int and sometimes as long.
      As the result ZODB/go and Wendelin.core had private XInt64 helper used
      as follows:
      
      https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/pycompat/pycompat.go
      https://lab.nexedi.com/nexedi/wendelin.core/-/blob/07087ec8/wcfs/internal/zdata/zblk.go#L385
      https://lab.nexedi.com/search?utf8=%E2%9C%93&search=XInt64&group_id=&project_id=73&search_code=true&repository_ref=f7776fc1&nav_source=navbar
      
      Besides int/long we are also hitting similar situation with
      bytes / str(py2) / str(py3) which will also need to use corresponding
      helpers to be able to load pickled data from existing databases.
      
      In the next patch we will add StrictUnicode mode to handle that strings,
      and it will naturally come with AsBytes and AsString helpers. Thus it
      also feels appropriate to keep the other useful helpers to handle
      unpickled data inside ogórek itself.
      
      This patch adds AsInt64 helper with top-level description and tests.
      010fbd2e
    • Kirill Smelkov's avatar
      Drop support for Go 1.7 and Go 1.8 · 17fed0ed
      Kirill Smelkov authored
      In the next patch I will need to use big.Int.IsInt64, which was added in
      Go 1.9 in 2017. Go 1.7 and Go 1.8 are very old nowdays - their last
      point releases were done in 2017 (go1.7.6) and 2018 (go1.8.7) - 7 and 6
      years ago correspondingly. Since then those releases are EOL with many
      further Go releases done after them. We even specify go1.12 as the
      requirement in our go.mod, but that is probably because go1.12 is the
      first version to support modules. Anyway dropping support for those very
      old Go1.7 and Go1.8 should be almost certainly ok as I doubt if there
      are current ogórek users still stuck at those old Go versions.
      17fed0ed
    • Kirill Smelkov's avatar
      encoder: Fix protocol 0 UNICODE emission for invalid UTF-8 · 09fec8bd
      Kirill Smelkov authored
      In 9daf6a2a (Fix UNICODE decoding) I fixed protocol 0 UNICODE decoding
      by implementing "raw-unicode-escape" decoder and switching unpickler to
      use that to match 1-to-1 what Python unpickler does.
      
      In 57f875fd (encoder: Fix protocol 0 UNICODE emission) I further fixed
      protocol 0 UNICODE encoding by implementing "raw-unicode-escape" encoder
      and switching pickler to use that, trying to match 1-to-1 what Python
      pickler does.
      
      However there is a difference in between Python unicode and unicode on
      Go side: in Python unicode is immutable array of UCS code points. On Go
      side, unicode is immutable array of bytes, that, similarly to Go string
      are treated as being mostly UTF-8. We did not pick e.g. []rune to
      represent unicode on Go side because []rune is not immutable and so
      cannot be used as map keys.
      
      So Go unicode can be either valid UTF-8 or invalid UTF-8. For valid
      UTF-8 case everything is working as intended: our raw-unicode-escape
      decoder, because it works in terms of UCS, can produce only valid UTF-8,
      while our raw-unicode-escape encoder also matches 1-to-1 what python
      does because for valid UTF-8 utf8.DecodeRuneInString gives good rune and
      the encoder further works in terms of UCS.
      
      However for the case of invalid UTF-8, there is a difference in between
      what Python and Go raw-unicode-escape encoders do: for Python this case
      is simply impossible because there input is []UCS. For the Go case, in
      57f875fd I extended the encoder to do:
      
      	// invalid UTF-8 -> emit byte as is
      	case r == utf8.RuneError:
      		out = append(out, s[0])
      
      which turned out to be not very thoughtful and wrong because
      original raw-unicode-escape also emits UCS < 0x100 as those plain bytes
      and so if we also emit invalid UTF-8 as is then the following two inputs
      would be encoded into the same representation "\x93":
      
      	unicode("\x93")     // invalid UTF-8
      	unicode("\xc2\x93)  // UTF-8 of \u93
      
      which would break `decode/encode = identity` invariant and corrupt the
      data because when loaded back it will be "\xc2\x93" instead of original "\x93".
      
      -> Fix it by rejecting to encode such invalid UTF-8 via protocol 0 UNICODE.
      
      Unfortunately rejection is the only reasonable choice because
      raw-unicode-escape codec does not allow \xAA to be present in the output
      stream and so there is simply no way to represent arbitrary bytes there.
      
      Better to give explicit error instead of corrupting the data.
      
      For protocols ≥ 1 arbitrary unicode - both valid and invalid UTF-8 - can
      still be loaded and saved because *BINUNICODE opcodes come with
      bytestring argument and there is no need to decode/encode those
      bytestrings.
      
      /reviewed-by @kisielk
      /reviewed-on https://github.com/kisielk/og-rek/pull/67
      09fec8bd
    • Kirill Smelkov's avatar
      Make Bytes and unicode to be distinguishable from string in test errors · 1fea98e4
      Kirill Smelkov authored
      For example currently if result of decoding is expected to be unicode,
      but is actually string the following confusing error is generated with
      exactly the same have and want:
      
          --- FAIL: TestDecode/unicode(non-utf8)/"X\x01\x00\x00\x00\x93." (0.00s)
              ogorek_test.go:504: decode:
                  have: "\x93"
                  want: "\x93"
      
      This happens because by default Printf("%#v") does not emit object type
      even if it was derived type from builtin string - either our Bytes or
      unicode, and emits the same string representation irregardless of which
      type it is.
      
      I think out-of-the box behaviour of %#v is unreasonable, but it is there
      for a long time and so instead of trying to argue with Go upstream,
      which after many years of arguing I think is not practical, teach our
      Bytes and unicode types how to represent themselves under %#v with
      indicating their type in the output.
      
      After the fix the error from above becomes
      
          --- FAIL: TestDecode/unicode(non-utf8)/"X\x01\x00\x00\x00\x93." (0.00s)
              ogorek_test.go:504: decode:
                  have: "\x93"
                  want: ogórek.unicode("\x93")
      
      which is more clear because want is different from have.
      
      Adjusted behaviour is not test-only, because user applications might
      want to do the same %#v at application level and there it would be also
      confusing not to see a type.
      
      /reviewed-by @kisielk
      /reviewed-on https://github.com/kisielk/og-rek/pull/67
      1fea98e4
  10. 12 Jul, 2024 5 commits
    • Kamil Kisiel's avatar
      Merge pull request #66 from navytux/y/fuzz-sha1 · 9913a56c
      Kamil Kisiel authored
      fuzz: Export tests pickles to fuzz/corpus with stable filenames
      9913a56c
    • Kirill Smelkov's avatar
      fuzz: Export tests pickles to fuzz/corpus with stable filenames · 1e108391
      Kirill Smelkov authored
      In 9d1344ba (fuzz: Automatically export all tests pickles into
      fuzz/corpus) I added code that automatically exports all test pickles
      to fuzz corpus, so that fuzzing starts from all the tricky cases we
      already put into tests. However exported files were named as
      test-i-j.pickle, where i is sequential number of test entry in global
      test vector. This way if we insert some new entry in the middle, or
      change entries order it causes lots of renaming inside fuzz/corpus/ on
      next `go generate`.
      
      -> Fix that by emitting vectors as test-<sha1-of-pickle>.pickle so that
      file name depends only on pickle content are remains stable
      independently of entry position in the list of tests.
      
      Besides changes to fuzz_test.go this patch brings:
      
      1) rm fuzz/corpus/test-*.pickle
      2) `go generate`
      
      The result inside fuzz/corpus/ is pure renaming + removal of duplicates.
      For example test-15-4.pickle is removed because the same data was in test-14-4.pickle
      and now resides in test-ee4459c0c165c4cbc3672ee5ef5438e00121b229.pickle.
      1e108391
    • Kamil Kisiel's avatar
      Merge pull request #65 from navytux/y/bytearray8-fix · 9ca4a70a
      Kamil Kisiel authored
      decoder: Fix integer overflow in BYTEARRAY8 handling
      9ca4a70a
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · 4f485784
      Kirill Smelkov authored
      Rerunning recent go-fuze resulted in adding some new test vectors.
      4f485784
    • Kirill Smelkov's avatar
      decoder: Fix integer overflow in BYTEARRAY8 handling · 45fb14a6
      Kirill Smelkov authored
      Rerunning fuzzing afresh found the following crash because uint64 len > max(int64)
      was cast to int resulting in negative number:
      
          "\x960000000\xbd"
      
          panic: bytes.Buffer.Grow: negative count
      
          goroutine 1 [running]:
          bytes.(*Buffer).Grow(0xc000012140?, 0xc00010abf8?)
                  /home/kirr/src/tools/go/go1.22/src/bytes/buffer.go:166 +0xb4
          github.com/kisielk/og-rek.(*Decoder).bufLoadBytesData(0xc000072000, 0xbd30303030303030)
                  /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:788 +0xd5
          github.com/kisielk/og-rek.(*Decoder).bufLoadBinData8(0xc000072000)
                  /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:776 +0x105
          github.com/kisielk/og-rek.(*Decoder).loadBytearray8(0xc000072000)
                  /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:1270 +0x34
          github.com/kisielk/og-rek.(*Decoder).Decode(0xc000072000)
                  /home/kirr/src/neo/src/github.com/kisielk/og-rek/ogorek.go:311 +0x1a2c
          github.com/kisielk/og-rek.Fuzz({0x7fd052187000, 0x9, 0x9})
                  /home/kirr/src/neo/src/github.com/kisielk/og-rek/fuzz.go:15 +0xf6
          go-fuzz-dep.Main({0xc00010af38, 0x1, 0x5bc338?})
                  go-fuzz-dep/main.go:36 +0x14c
          main.main()
                  github.com/kisielk/og-rek/go.fuzz.main/main.go:15 +0x35
          exit status 2
      
      -> Fix it by first comparing uint64 len without casting and further casting to
      int only after we cap the len to be in safe range.
      45fb14a6
  11. 07 Jul, 2024 1 commit
  12. 10 Mar, 2021 1 commit
  13. 16 Sep, 2020 2 commits
    • Kirill Smelkov's avatar
      Add module support · 24bb08c2
      Kirill Smelkov authored
      Ogórek can be already used from another module (but at outdated
      version), but let's be explicit about that. Go.mod was created by `go
      mod init github.com/kisielk/og-rek`. There is no external dependencies.
      24bb08c2
    • Kirill Smelkov's avatar
      .travis += go1.12, go1.13, go1.14 and go1.15 · 4cf717d2
      Kirill Smelkov authored
      Go1.14 and Go1.15 are the current stable Go releases.
      Go1.12 and Go1.14 fill in the gap in testing coverage in between -go1.11
      and current go releases.
      
      Maybe we can eventually drop older go release from travis coverage.
      4cf717d2
  14. 28 Sep, 2018 5 commits
    • Kirill Smelkov's avatar
      Package-level documentation · 8b25c4ce
      Kirill Smelkov authored
      Add brief overview on to how to use ogórek, and how Python and Go types
      are mapped with each other. For completeness give overview of pickle
      protocols and overview of persistent references.
      
      (Documentation put into separate doc.go as suggested by Kamil)
      8b25c4ce
    • Kirill Smelkov's avatar
      fuzz/corpus: Update · e5126a53
      Kirill Smelkov authored
      Rerun `go generate` since new tests vectors have been added in recent
      bytes/bytearray patches.
      e5126a53
    • Kirill Smelkov's avatar
      Fix/Add support for []byte (= bytearray on Python side) · 30ebda02
      Kirill Smelkov authored
      Starting from 2013 (from 555efd8f "first draft of dumb pickle encoder")
      wrt []byte ogórek state was:
      
      1. []byte was encoded as string
      2. there was no way to decode a pickle object into []byte
      
      then, as
      
      - []byte encoding was never explicitly tested,
      - nor I could find any usage of such encodings via searching through all Free /
        Open-Source software ogórek uses - I searched via "Uses" of NewEncoder on godoc:
      
        https://sourcegraph.com/github.com/kisielk/og-rek/-/blob/encode.go#L48:6=&tab=references:external
      
      it is likely that []byte encoding support was added just for the sake of
      it and convenience and then never used. It is also likely that the
      original author does not use ogórek encoder anymore:
      
      	https://github.com/kisielk/og-rek/pull/52#issuecomment-423639026
      
      For those reasons I tend to think that it should be relatively safe to
      change how []byte is handled:
      
      - the need to change []byte handling is that currently []byte is a kind of
        exception: we can only encode it and not decode something into it.
        Currently encode/decode roundtrip for []byte gives string, which breaks
        the property of encode/decode being identity for all other basic types.
      
      - on the similar topic, on encoding strings are assumed UTF-8 and are
        encoded as UNICODE opcodes for protocol >= 3. Passing arbitrary bytes
        there seems to be not good.
      
      - on to how change []byte - sadly it cannot be used as Python's bytes
        counterpart. In fact in the previous patch we actually just added
        ogórek.Bytes as a counterpart for Python bytes. We did not used []byte
        for that because - contrary to Python bytes - []byte cannot be used as a
        dict key.
      
      - the most natural counterpart for Go's []byte is thus Python's
        bytearray:
      
      	https://docs.python.org/3/library/stdtypes.html#bytearray-objects
      
        which is "a mutable counterpart to bytes objects"
      
      So add Python's bytearray decoding into []byte, and correspondingly
      change []byte encoding to be encoded as bytearray.
      
      P.S.
      
      This changes encoder semantic wrt []byte. If some ogórek use breaks
      somewhere because of it, we could add an encoder option to restore
      backward compatible behaviour. However since I suspect noone was
      actually encoding []byte into pickles, I prefer we wait for someone to
      speak-up first instead of loading EncoderConfig with confusion options
      that nobody will ever use.
      30ebda02
    • Kirill Smelkov's avatar
      decoder: Remember protocol version as last seen in a PROTO opcode · 2c359fc1
      Kirill Smelkov authored
      In the next patch decoding a pickle will depend on whether current
      protocol is <= 2 or >= 3. For this let's teach PROTO opcode handler to
      recall last seen protocol version.
      
      For testing - prepare tests infrastructure for cases where protocol
      version affects decoding semantic: if a test pickle in tests table comes
      with \x80\xff prefix, on decode tests this prefix will be changed to
      concrete
      
      	\x80 ver
      
      with all versions protperly iterated as specified in TestPickle.protov.
      
      We will also need this functionality in the next patch.
      2c359fc1
    • Kirill Smelkov's avatar
      Add support for Python bytes · 514123d4
      Kirill Smelkov authored
      In Python bytes is immutable and read-only array of bytes. It is
      also hashable and so is different from go []byte in that it can be
      used as a dict key. Thus the closes approximation for Python bytes in Go
      is some type derived from Go's string - it will be different from string
      and at the same time will inherit from string it immutability property
      and being able to be used as map key. So
      
      - add ogórek.Bytes type to represent Python bytes
      - add support to decode BINBYTES* pickle opcodes (these are protocol 3 opcodes)
      - add support to encode ogórek.Bytes via those BINBYTES* opcodes
      - for protocols <= 2, where there is no opcodes to directly represent
        bytes, adopt the same approach as Python - by pickling bytes as
      
      	_codecs.encode(byt.decode('latin1'), 'latin1')
      
        this way unpickling it on Python3 will give bytes, while unpickling it
        on Python2 will give str:
      
      	In [1]: sys.version
      	Out[1]: '3.6.6 (default, Jun 27 2018, 14:44:17) \n[GCC 8.1.0]'
      
      	In [2]: byt = b'\x01\x02\x03'
      
      	In [3]: _codecs.encode(byt.decode('latin1'), 'latin1')
      	Out[3]: b'\x01\x02\x03'
      
        ---
      
      	In [1]: sys.version
      	Out[1]: '2.7.15+ (default, Aug 31 2018, 11:56:52) \n[GCC 8.2.0]'
      
      	In [2]: byt = b'\x01\x02\x03'
      
      	In [3]: _codecs.encode(byt.decode('latin1'), 'latin1')
      	Out[3]: '\x01\x02\x03'
      
      - correspondingly teach decoder to recognize particular calls to
        _codecs.encode as being representation for bytes and decode it
        appropriately.
      
      - since we now have to emit byt.decode('latin1') as UNICODE - add, so
        far internal, `type unicode(string)` that instructs ogórek encoder to
        always emit the string with UNICODE opcodes (regular string is encoded
        to unicode pickle object only for protocol >= 3).
      
      - For []byte encoding preserve the current status - even though
        dispatching in Encoder.encode changes, the end result is the same -
        []byte was and stays currently encoded as just regular string.
      
        This was added in 555efd8f "first draft of dumb pickle encoder", and
        even though that might be not a good choice, changing it is a topic for
        another patch.
      514123d4