Google Protocol Buffers (protobuf): format specification

Google Protocol Buffers (AKA protobuf) is a popular data serialization scheme used for communication protocols, data storage, etc. There are implementations are available for almost every popular language. The focus points of this scheme are brevity (data is encoded in a very size-efficient manner) and extensibility (one can add keys to the structure, while keeping it readable in previous version of software).

Protobuf uses semi-self-describing encoding scheme for its messages. It means that it is possible to parse overall structure of the message (skipping over fields one can't understand), but to fully understand the message, one needs a protocol definition file (.proto). To be specific:

"Keys" in key-value pairs provided in the message are identified only with an integer "field tag". .proto file provides info on which symbolic field names these field tags map to.
"Keys" also provide something called "wire type". It's not a data type in its common sense (i.e. you can't, for example, distinguish sint32 vs uint32 vs some enum, or string from bytes), but it's enough information to determine how many bytes to parse. Interpretation of the value should be done according to the type specified in .proto file.
There's no direct information on which fields are optional / required, which fields may be repeated or constitute a map, what restrictions are placed on fields usage in a single message, what are the fields' default values, etc, etc.

KS implementation details

License: MIT

Minimal Kaitai Struct required: 0.7

References

This page hosts a formal specification of Google Protocol Buffers (protobuf) using Kaitai Struct. This specification can be automatically translated into a variety of programming languages to get a parsing library.

Block diagram

Format specification in Kaitai Struct YAML

meta:
  id: google_protobuf
  title: Google Protocol Buffers (protobuf)
  xref:
    justsolve: Protobuf
    wikidata: Q1645574
  license: MIT
  ks-version: 0.7
  imports:
    - /common/vlq_base128_le
doc: |
  Google Protocol Buffers (AKA protobuf) is a popular data
  serialization scheme used for communication protocols, data storage,
  etc. There are implementations are available for almost every
  popular language. The focus points of this scheme are brevity (data
  is encoded in a very size-efficient manner) and extensibility (one
  can add keys to the structure, while keeping it readable in previous
  version of software).

  Protobuf uses semi-self-describing encoding scheme for its
  messages. It means that it is possible to parse overall structure of
  the message (skipping over fields one can't understand), but to
  fully understand the message, one needs a protocol definition file
  (`.proto`). To be specific:

  * "Keys" in key-value pairs provided in the message are identified
    only with an integer "field tag". `.proto` file provides info on
    which symbolic field names these field tags map to.
  * "Keys" also provide something called "wire type". It's not a data
    type in its common sense (i.e. you can't, for example, distinguish
    `sint32` vs `uint32` vs some enum, or `string` from `bytes`), but
    it's enough information to determine how many bytes to
    parse. Interpretation of the value should be done according to the
    type specified in `.proto` file.
  * There's no direct information on which fields are optional /
    required, which fields may be repeated or constitute a map, what
    restrictions are placed on fields usage in a single message, what
    are the fields' default values, etc, etc.
doc-ref: https://protobuf.dev/programming-guides/encoding/
seq:
  - id: pairs
    type: pair
    repeat: eos
    doc: Key-value pairs which constitute a message
types:
  pair:
    doc: Key-value pair
    seq:
      - id: key
        type: vlq_base128_le
        doc: |
          Key is a bit-mapped variable-length integer: lower 3 bits
          are used for "wire type", and everything higher designates
          an integer "field tag".
      - id: value
        doc: |
          Value that corresponds to field identified by
          `field_tag`. Type is determined approximately: there is
          enough information to parse it unambiguously from a stream,
          but further information from `.proto` file is required to
          interpret it properly.
        type:
          switch-on: wire_type
          cases:
            'wire_types::varint': vlq_base128_le
            'wire_types::len_delimited': delimited_bytes
            'wire_types::bit_64': u8le
            'wire_types::bit_32': u4le
    instances:
      wire_type:
        value: 'key.value & 0b111'
        enum: wire_types
        doc: |
          "Wire type" is a part of the "key" that carries enough
          information to parse value from the wire, i.e. read correct
          amount of bytes, but there's not enough information to
          interpret in unambiguously. For example, one can't clearly
          distinguish 64-bit fixed-sized integers from 64-bit floats,
          signed zigzag-encoded varints from regular unsigned varints,
          arbitrary bytes from UTF-8 encoded strings, etc.
      field_tag:
        value: 'key.value >> 3'
        doc: |
          Identifies a field of protocol. One can look up symbolic
          field name in a `.proto` file by this field tag.
    enums:
      wire_types:
        0: varint
        1: bit_64
        2: len_delimited
        3: group_start
        4: group_end
        5: bit_32
  delimited_bytes:
    seq:
      - id: len
        type: vlq_base128_le
      - id: body
        size: len.value