SQLite3 database file: format specification

SQLite3 is a popular serverless SQL engine, implemented as a library to be used within other applications. It keeps its databases as regular disk files.

Every database file is segmented into pages. First page (starting at the very beginning) is special: it contains a file-global header which specifies some data relevant to proper parsing (i.e. format versions, size of page, etc). After the header, normal contents of the first page follow.

Each page would be of some type, and generally, they would be reached via the links starting from the first page. First page type (root_page) is always "btree_page".

KS implementation details

License: CC0-1.0

This page hosts a formal specification of SQLite3 database file using Kaitai Struct. This specification can be automatically translated into a variety of programming languages to get a parsing library.

Block diagram

Format specification in Kaitai Struct YAML

meta:
  id: sqlite3
  title: SQLite3 database file
  endian: be
  imports:
    - /common/vlq_base128_be
  license: CC0-1.0
doc: |
  SQLite3 is a popular serverless SQL engine, implemented as a library
  to be used within other applications. It keeps its databases as
  regular disk files.

  Every database file is segmented into pages. First page (starting at
  the very beginning) is special: it contains a file-global header
  which specifies some data relevant to proper parsing (i.e. format
  versions, size of page, etc). After the header, normal contents of
  the first page follow.

  Each page would be of some type, and generally, they would be
  reached via the links starting from the first page. First page type
  (`root_page`) is always "btree_page".
doc-ref: https://www.sqlite.org/fileformat.html
seq:
  - id: magic
    contents: ["SQLite format 3", 0]
  - id: len_page_mod
    type: u2
    doc: |
      The database page size in bytes. Must be a power of two between
      512 and 32768 inclusive, or the value 1 representing a page size
      of 65536.
  - id: write_version
    type: u1
    enum: versions
  - id: read_version
    type: u1
    enum: versions
  - id: reserved_space
    type: u1
    doc: Bytes of unused "reserved" space at the end of each page. Usually 0.
  - id: max_payload_frac
    type: u1
    doc: Maximum embedded payload fraction. Must be 64.
  - id: min_payload_frac
    type: u1
    doc: Minimum embedded payload fraction. Must be 32.
  - id: leaf_payload_frac
    type: u1
    doc: Leaf payload fraction. Must be 32.
  - id: file_change_counter
    type: u4
  - id: num_pages
    type: u4
    doc: Size of the database file in pages. The "in-header database size".
  - id: first_freelist_trunk_page
    type: u4
    doc: Page number of the first freelist trunk page.
  - id: num_freelist_pages
    type: u4
    doc: Total number of freelist pages.
  - id: schema_cookie
    type: u4
  - id: schema_format
    type: u4
    doc: The schema format number. Supported schema formats are 1, 2, 3, and 4.
  - id: def_page_cache_size
    type: u4
    doc: Default page cache size.
  - id: largest_root_page
    type: u4
    doc: The page number of the largest root b-tree page when in auto-vacuum or incremental-vacuum modes, or zero otherwise.
  - id: text_encoding
    type: u4
    enum: encodings
    doc: The database text encoding. A value of 1 means UTF-8. A value of 2 means UTF-16le. A value of 3 means UTF-16be.
  - id: user_version
    type: u4
    doc: The "user version" as read and set by the user_version pragma.
  - id: is_incremental_vacuum
    type: u4
    doc: True (non-zero) for incremental-vacuum mode. False (zero) otherwise.
  - id: application_id
    type: u4
    doc: The "Application ID" set by PRAGMA application_id.
  - id: reserved
    size: 20
  - id: version_valid_for
    type: u4
  - id: sqlite_version_number
    type: u4
  - id: root_page
    type: btree_page
instances:
  len_page:
    value: 'len_page_mod == 1 ? 0x10000 : len_page_mod'
types:
  btree_page:
    seq:
      - id: page_type
        type: u1
      - id: first_freeblock
        type: u2
      - id: num_cells
        type: u2
      - id: ofs_cells
        type: u2
      - id: num_frag_free_bytes
        type: u1
      - id: right_ptr
        type: u4
        if: page_type == 2 or page_type == 5
      - id: cells
        type: ref_cell
        repeat: expr
        repeat-expr: num_cells
  ref_cell:
    seq:
      - id: ofs_body
        type: u2
    instances:
      body:
        pos: ofs_body
        type:
          switch-on: _parent.page_type
          cases:
            0x0d: cell_table_leaf
            0x05: cell_table_interior
            0x0a: cell_index_leaf
            0x02: cell_index_interior
  cell_table_leaf:
    doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages'
    seq:
      - id: len_payload
        type: vlq_base128_be
      - id: row_id
        type: vlq_base128_be
      - id: payload
        size: len_payload.value
        type: cell_payload
      # TODO: overflow
  cell_table_interior:
    doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages'
    seq:
      - id: left_child_page
        type: u4
      - id: row_id
        type: vlq_base128_be
  cell_index_leaf:
    doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages'
    seq:
      - id: len_payload
        type: vlq_base128_be
      - id: payload
        size: len_payload.value
        type: cell_payload
      # TODO: overflow
  cell_index_interior:
    doc-ref: 'https://www.sqlite.org/fileformat.html#b_tree_pages'
    seq:
      - id: left_child_page
        type: u4
      - id: len_payload
        type: vlq_base128_be
      - id: payload
        size: len_payload.value
        type: cell_payload
  cell_payload:
    doc-ref: 'https://sqlite.org/fileformat2.html#record_format'
    seq:
      - id: len_header_and_len
        type: vlq_base128_be
      - id: column_serials
        size: len_header_and_len.value - 1
        type: serials
      - id: column_contents
        repeat: expr
        repeat-expr: column_serials.entries.size
        type: column_content(column_serials.entries[_index])
  serials:
    seq:
      - id: entries
        type: vlq_base128_be
        repeat: eos
  serial:
    seq:
      - id: code
        type: vlq_base128_be
    instances:
      is_blob:
        value: 'code.value >= 12 and (code.value % 2 == 0)'
      is_string:
        value: 'code.value >= 13 and (code.value % 2 == 1)'
      len_content:
        value: (code.value - 12) / 2
        if: code.value >= 12
  column_content:
    params:
      - id: ser
        type: struct
    seq:
      - id: as_int
        type:
          switch-on: serial_type.code.value
          cases:
            1: u1
            2: u2
            3: b24
            4: u4
            5: b48
            6: u8
        if: serial_type.code.value >= 1 and serial_type.code.value <= 6
      - id: as_float
        type: f8
        if: serial_type.code.value == 7
      - id: as_blob
        size: serial_type.len_content
        if: serial_type.is_blob
      - id: as_str
        type: str
        size: serial_type.len_content
        encoding: UTF-8
#        if: _root.text_encoding == encodings::utf_8 and serial_type.is_string
    instances:
      serial_type:
        value: ser.as
enums:
  versions:
    1: legacy
    2: wal
  encodings:
    1: utf_8
    2: utf_16le
    3: utf_16be