Python pickle serialization format: Ruby parsing library

Python Pickle format serializes Python objects to a byte stream, as a sequence of operations to run on the Pickle Virtual Machine.

The format is mostly implementation defined, there is no formal specification. Pickle data types are closely coupled to the Python object model. Python singletons, and most builtin types (e.g. None, int,dict, list) are serialised using dedicated Pickle opcodes. Other builtin types, and all classes (e.g. set, datetime.datetime) are serialised by encoding the name of a constructor callable. They are deserialised by importing that constructor, and calling it. So, unpickling an arbitrary pickle, using the Python's stdlib pickle module can cause arbitrary code execution.

Pickle format has evolved with Python, later protocols add opcodes & types. Later Python releases can pickle to or unpickle from any earlier protocol.

  • Protocol 0: ASCII clean, no explicit version, fields are '\n' terminated.
  • Protocol 1: Binary, no explicit version, first length prefixed types.
  • Protocol 2 (PEP 307): Python 2.3+. Explicit versioning, more length prefixed types.
  • Protocol 3: Python 3.0+. Dedicated opcodes for bytes objects.
  • Protocol 4 (PEP 3154): Python 3.4+. Opcodes for 64 bit strings, framing, set.
  • Protocol 5 (PEP 574): Python 3.8+: Opcodes for bytearray and out of band data

Application

Python

File extension

["pickle", "pkl"]

KS implementation details

License: CC0-1.0

References

This page hosts a formal specification of Python pickle serialization format using Kaitai Struct. This specification can be automatically translated into a variety of programming languages to get a parsing library.

Usage

Runtime library

All parsing code for Ruby generated by Kaitai Struct depends on the Ruby runtime library. You have to install it before you can parse data.

The Ruby runtime library can be installed from RubyGems:

gem install kaitai-struct

Code

Parse a local file and get structure in memory:

data = PythonPickle.from_file("path/to/local/file.pickle")

Or parse structure from a string of bytes:

bytes = "\x00\x01\x02..."
data = PythonPickle.new(Kaitai::Struct::Stream.new(bytes))

After that, one can get various attributes from the structure by invoking getter methods like:

data.ops # => get ops

Ruby source code to parse Python pickle serialization format

python_pickle.rb

# This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild

require 'kaitai/struct/struct'

unless Gem::Version.new(Kaitai::Struct::VERSION) >= Gem::Version.new('0.9')
  raise "Incompatible Kaitai Struct Ruby API: 0.9 or later is required, but you have #{Kaitai::Struct::VERSION}"
end


##
# Python Pickle format serializes Python objects to a byte stream, as a sequence
# of operations to run on the Pickle Virtual Machine.
# 
# The format is mostly implementation defined, there is no formal specification.
# Pickle data types are closely coupled to the Python object model.
# Python singletons, and most builtin types (e.g. `None`, `int`,`dict`, `list`)
# are serialised using dedicated Pickle opcodes.
# Other builtin types, and all classes  (e.g. `set`, `datetime.datetime`) are
# serialised by encoding the name of a constructor callable.
# They are deserialised by importing that constructor, and calling it.
# So, unpickling an arbitrary pickle, using the Python's stdlib pickle module
# can cause arbitrary code execution.
# 
# Pickle format has evolved with Python, later protocols add opcodes & types.
# Later Python releases can pickle to or unpickle from any earlier protocol.
# 
# * Protocol 0: ASCII clean, no explicit version, fields are '\n' terminated.
# * Protocol 1: Binary, no explicit version, first length prefixed types.
# * Protocol 2 ([PEP 307](https://peps.python.org/pep-0307/)): Python 2.3+.
#   Explicit versioning, more length prefixed types.
# * Protocol 3: Python 3.0+. Dedicated opcodes for `bytes` objects.
# * Protocol 4 ([PEP 3154](https://peps.python.org/pep-3154/)): Python 3.4+.
#   Opcodes for 64 bit strings, framing, `set`.
# * Protocol 5 ([PEP 574](https://peps.python.org/pep-0574/)): Python 3.8+:
# Opcodes for `bytearray` and out of band data
# @see https://github.com/python/cpython/blob/v3.8.1/Lib/pickletools.py Source
class PythonPickle < Kaitai::Struct::Struct

  OPCODE = {
    40 => :opcode_mark,
    41 => :opcode_empty_tuple,
    46 => :opcode_stop,
    48 => :opcode_pop,
    49 => :opcode_pop_mark,
    50 => :opcode_dup,
    66 => :opcode_binbytes,
    67 => :opcode_short_binbytes,
    70 => :opcode_float,
    71 => :opcode_binfloat,
    73 => :opcode_int,
    74 => :opcode_binint,
    75 => :opcode_binint1,
    76 => :opcode_long,
    77 => :opcode_binint2,
    78 => :opcode_none,
    80 => :opcode_persid,
    81 => :opcode_binpersid,
    82 => :opcode_reduce,
    83 => :opcode_string,
    84 => :opcode_binstring,
    85 => :opcode_short_binstring,
    86 => :opcode_unicode,
    88 => :opcode_binunicode,
    93 => :opcode_empty_list,
    97 => :opcode_append,
    98 => :opcode_build,
    99 => :opcode_global_opcode,
    100 => :opcode_dict,
    101 => :opcode_appends,
    103 => :opcode_get,
    104 => :opcode_binget,
    105 => :opcode_inst,
    106 => :opcode_long_binget,
    108 => :opcode_list,
    111 => :opcode_obj,
    112 => :opcode_put,
    113 => :opcode_binput,
    114 => :opcode_long_binput,
    115 => :opcode_setitem,
    116 => :opcode_tuple,
    117 => :opcode_setitems,
    125 => :opcode_empty_dict,
    128 => :opcode_proto,
    129 => :opcode_newobj,
    130 => :opcode_ext1,
    131 => :opcode_ext2,
    132 => :opcode_ext4,
    133 => :opcode_tuple1,
    134 => :opcode_tuple2,
    135 => :opcode_tuple3,
    136 => :opcode_newtrue,
    137 => :opcode_newfalse,
    138 => :opcode_long1,
    139 => :opcode_long4,
    140 => :opcode_short_binunicode,
    141 => :opcode_binunicode8,
    142 => :opcode_binbytes8,
    143 => :opcode_empty_set,
    144 => :opcode_additems,
    145 => :opcode_frozenset,
    146 => :opcode_newobj_ex,
    147 => :opcode_stack_global,
    148 => :opcode_memoize,
    149 => :opcode_frame,
    150 => :opcode_bytearray8,
    151 => :opcode_next_buffer,
    152 => :opcode_readonly_buffer,
  }
  I__OPCODE = OPCODE.invert
  def initialize(_io, _parent = nil, _root = self)
    super(_io, _parent, _root)
    _read
  end

  def _read
    @ops = []
    i = 0
    begin
      _ = Op.new(@_io, self, @_root)
      @ops << _
      i += 1
    end until _.code == :opcode_stop
    self
  end

  ##
  # Length prefixed string, between 0 and 2**64-1 bytes long.
  # 
  # Only a 64-bit build of Python would produce a pickle containing strings
  # large enough to need this type. Such a pickle could not be unpickled on
  # a 32-bit build of Python, because the string would be larger than
  # `sys.maxsize`.
  class Unicodestring8 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u8le
      @val = (@_io.read_bytes(len)).force_encoding("utf8")
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Large signed integer, in the range -2**(8*255-1) to 2**(8*255-1)-1,
  # encoded as two's complement.
  class Long1 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u1
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Length prefixed string, between 0 and 2**64-1 bytes long.
  # 
  # Only a 64-bit build of Python would produce a pickle containing strings
  # large enough to need this type. Such a pickle could not be unpickled on
  # a 32-bit build of Python, because the string would be larger than
  # `sys.maxsize`.
  class Bytes8 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u8le
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Length prefixed byte string, between 0 and 255 bytes long.
  class Bytes1 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u1
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Length prefixed string, between 0 and 2**32-1 bytes long
  class Bytes4 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u4le
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Some opcodes take no argument, this empty type is used for them.
  class NoArg < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      self
    end
  end

  ##
  # Unquoted string, does not contain string escapes.
  class StringnlNoescape < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val = (@_io.read_bytes_term(10, false, true, true)).force_encoding("ascii")
      self
    end
    attr_reader :val
  end

  ##
  # Integer, encoded with the ASCII chracters [0-9-], followed by 'L'.
  class DecimalnlLong < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val = (@_io.read_bytes_term(10, false, true, true)).force_encoding("ascii")
      self
    end
    attr_reader :val
  end

  ##
  # Length prefixed string, between 0 and 2**32-1 bytes long
  class Unicodestring4 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u4le
      @val = (@_io.read_bytes(len)).force_encoding("utf8")
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Unquoted string, containing Python Unicode escapes.
  class Unicodestringnl < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val = (@_io.read_bytes_term(10, false, true, true)).force_encoding("ascii")
      self
    end
    attr_reader :val
  end

  ##
  # Large signed integer, in the range -2**(8*2**32-1) to 2**(8*2**32-1)-1,
  # encoded as two's complement.
  class Long4 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u4le
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Length prefixed string, between 0 and 255 bytes long. Encoding is
  # unspecified.
  # 
  # The default Python 2.x string type (`str`) is a sequence of bytes.
  # These are pickled as `string1` or `string4`, when protocol == 2.
  # The bytes are written directly, no explicit encoding is performed.
  # 
  # Python 3.x will not pickle an object as `string1` or `string4`.
  # Instead, opcodes and types with a known encoding are used.
  # When unpickling
  # 
  # - `pickle.Unpickler` objects default to ASCII, which can be overriden
  # - `pickletools.dis` uses latin1, and cannot be overriden
  # @see https://github.com/python/cpython/blob/bb8071a4cae/Lib/pickle.py#L486-L495 Source
  class String1 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u1
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Length prefixed string, between 0 and 2**64-1 bytes long.
  # 
  # The contents are deserilised into a `bytearray` object.
  class Bytearray8 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u8le
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Integer or boolean, encoded with the ASCII characters [0-9-].
  # 
  # The values '00' and '01' encode the Python values `False` and `True`.
  # Normally a value would not contain leading '0' characters.
  class DecimalnlShort < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val = (@_io.read_bytes_term(10, false, true, true)).force_encoding("ascii")
      self
    end
    attr_reader :val
  end

  ##
  # Length prefixed string, between 0 and 255 bytes long
  class Unicodestring1 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_u1
      @val = (@_io.read_bytes(len)).force_encoding("utf8")
      self
    end
    attr_reader :len
    attr_reader :val
  end

  ##
  # Quoted string, possibly containing Python string escapes.
  class Stringnl < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val = (@_io.read_bytes_term(10, false, true, true)).force_encoding("ascii")
      self
    end
    attr_reader :val
  end

  ##
  # Pair of unquoted, unescaped strings.
  class StringnlNoescapePair < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val1 = StringnlNoescape.new(@_io, self, @_root)
      @val2 = StringnlNoescape.new(@_io, self, @_root)
      self
    end
    attr_reader :val1
    attr_reader :val2
  end

  ##
  # Length prefixed string, between 0 and 2**31-1 bytes long. Encoding is
  # unspecified.
  # 
  # Although the len field is signed, any length < 0 will raise an exception
  # during unpickling.
  # 
  # See the documentation for `string1` for further detail about encodings.
  # @see https://github.com/python/cpython/blob/bb8071a4cae/Lib/pickle.py#L486-L495 Source
  class String4 < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @len = @_io.read_s4le
      @val = @_io.read_bytes(len)
      self
    end
    attr_reader :len
    attr_reader :val
  end
  class Op < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @code = Kaitai::Struct::Stream::resolve_enum(PythonPickle::OPCODE, @_io.read_u1)
      case code
      when :opcode_ext4
        @arg = @_io.read_u4le
      when :opcode_tuple1
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_setitem
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_readonly_buffer
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_stop
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_ext2
        @arg = @_io.read_u2le
      when :opcode_empty_tuple
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_newtrue
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_long
        @arg = DecimalnlLong.new(@_io, self, @_root)
      when :opcode_newobj
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_bytearray8
        @arg = Bytearray8.new(@_io, self, @_root)
      when :opcode_put
        @arg = DecimalnlShort.new(@_io, self, @_root)
      when :opcode_stack_global
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_pop_mark
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_append
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_newfalse
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_binpersid
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_build
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_empty_dict
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_tuple2
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_long4
        @arg = Long4.new(@_io, self, @_root)
      when :opcode_next_buffer
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_appends
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_binbytes
        @arg = Bytes4.new(@_io, self, @_root)
      when :opcode_dup
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_list
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_proto
        @arg = @_io.read_u1
      when :opcode_pop
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_frame
        @arg = @_io.read_u8le
      when :opcode_string
        @arg = Stringnl.new(@_io, self, @_root)
      when :opcode_binunicode
        @arg = Unicodestring4.new(@_io, self, @_root)
      when :opcode_float
        @arg = Floatnl.new(@_io, self, @_root)
      when :opcode_reduce
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_global_opcode
        @arg = StringnlNoescapePair.new(@_io, self, @_root)
      when :opcode_binput
        @arg = @_io.read_u1
      when :opcode_memoize
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_persid
        @arg = StringnlNoescape.new(@_io, self, @_root)
      when :opcode_ext1
        @arg = @_io.read_u1
      when :opcode_none
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_short_binunicode
        @arg = Unicodestring1.new(@_io, self, @_root)
      when :opcode_obj
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_binfloat
        @arg = @_io.read_f8be
      when :opcode_newobj_ex
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_empty_list
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_tuple
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_binunicode8
        @arg = Unicodestring8.new(@_io, self, @_root)
      when :opcode_binget
        @arg = @_io.read_u1
      when :opcode_dict
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_binstring
        @arg = String4.new(@_io, self, @_root)
      when :opcode_setitems
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_binint2
        @arg = @_io.read_u2le
      when :opcode_binbytes8
        @arg = Bytes8.new(@_io, self, @_root)
      when :opcode_binint1
        @arg = @_io.read_u1
      when :opcode_inst
        @arg = StringnlNoescapePair.new(@_io, self, @_root)
      when :opcode_long_binget
        @arg = @_io.read_u4le
      when :opcode_long_binput
        @arg = @_io.read_u4le
      when :opcode_int
        @arg = DecimalnlShort.new(@_io, self, @_root)
      when :opcode_binint
        @arg = @_io.read_s4le
      when :opcode_unicode
        @arg = Unicodestringnl.new(@_io, self, @_root)
      when :opcode_long1
        @arg = Long1.new(@_io, self, @_root)
      when :opcode_short_binstring
        @arg = String1.new(@_io, self, @_root)
      when :opcode_mark
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_frozenset
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_tuple3
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_additems
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_get
        @arg = DecimalnlShort.new(@_io, self, @_root)
      when :opcode_empty_set
        @arg = NoArg.new(@_io, self, @_root)
      when :opcode_short_binbytes
        @arg = Bytes1.new(@_io, self, @_root)
      end
      self
    end

    ##
    # Operation code that determines which action should be
    # performed next by the Pickle Virtual Machine. Some opcodes
    # are only available in later versions of the Pickle protocol.
    attr_reader :code

    ##
    # Optional argument for the operation. Data type and length
    # are determined by the value of the opcode.
    attr_reader :arg
  end

  ##
  # Double float, encoded with the ASCII characters [0-9.e+-], '-inf', 'inf',
  # or 'nan'.
  class Floatnl < Kaitai::Struct::Struct
    def initialize(_io, _parent = nil, _root = self)
      super(_io, _parent, _root)
      _read
    end

    def _read
      @val = (@_io.read_bytes_term(10, false, true, true)).force_encoding("ascii")
      self
    end
    attr_reader :val
  end
  attr_reader :ops
end