Android Dalvik VM executable (dex): format specification

Android OS applications executables are typically stored in its own format, optimized for more efficient execution in Dalvik virtual machine.

This format is loosely similar to Java .class file format and generally holds the similar set of data: i.e. classes, methods, fields, annotations, etc.

File extension

dex

KS implementation details

License: Apache-2.0

References

This page hosts a formal specification of Android Dalvik VM executable (dex) using Kaitai Struct. This specification can be automatically translated into a variety of programming languages to get a parsing library.

Block diagram

Format specification in Kaitai Struct YAML

meta:
  id: dex
  title: Android Dalvik VM executable (dex)
  file-extension: dex
  xref:
    pronom: fmt/694
    wikidata: Q29000585
  tags:
    - android
    - executable
  license: Apache-2.0
  imports:
    - /common/vlq_base128_le
  endian: le
doc: |
  Android OS applications executables are typically stored in its own
  format, optimized for more efficient execution in Dalvik virtual
  machine.

  This format is loosely similar to Java .class file format and
  generally holds the similar set of data: i.e. classes, methods,
  fields, annotations, etc.
doc-ref: https://source.android.com/docs/core/runtime/dex-format
seq:
  - id: header
    type: header_item
instances:
  string_ids:
    pos: header.string_ids_off
    type: string_id_item
    repeat: expr
    repeat-expr: header.string_ids_size
    doc: |
      string identifiers list.

      These are identifiers for all the strings used by this file, either for
      internal naming (e.g., type descriptors) or as constant objects referred to by code.

      This list must be sorted by string contents, using UTF-16 code point values
      (not in a locale-sensitive manner), and it must not contain any duplicate entries.
  type_ids:
    pos: header.type_ids_off
    type: type_id_item
    repeat: expr
    repeat-expr: header.type_ids_size
    doc: |
      type identifiers list.

      These are identifiers for all types (classes, arrays, or primitive types)
      referred to by this file, whether defined in the file or not.

      This list must be sorted by string_id index, and it must not contain any duplicate entries.
  proto_ids:
    pos: header.proto_ids_off
    type: proto_id_item
    repeat: expr
    repeat-expr: header.proto_ids_size
    doc: |
      method prototype identifiers list.

      These are identifiers for all prototypes referred to by this file.

      This list must be sorted in return-type (by type_id index) major order,
      and then by argument list (lexicographic ordering, individual arguments
      ordered by type_id index). The list must not contain any duplicate entries.
  field_ids:
    pos: header.field_ids_off
    type: field_id_item
    repeat: expr
    repeat-expr: header.field_ids_size
    doc: |
      field identifiers list.

      These are identifiers for all fields referred to by this file, whether defined in the file or not.

      This list must be sorted, where the defining type (by type_id index)
      is the major order, field name (by string_id index) is the intermediate
      order, and type (by type_id index) is the minor order.

      The list must not contain any duplicate entries.
  method_ids:
    pos: header.method_ids_off
    type: method_id_item
    repeat: expr
    repeat-expr: header.method_ids_size
    doc: |
      method identifiers list.

      These are identifiers for all methods referred to by this file,
      whether defined in the file or not.

      This list must be sorted, where the defining type (by type_id index
      is the major order, method name (by string_id index) is the intermediate
      order, and method prototype (by proto_id index) is the minor order.

      The list must not contain any duplicate entries.
  class_defs:
    pos: header.class_defs_off
    type: class_def_item
    repeat: expr
    repeat-expr: header.class_defs_size
    doc: |
      class definitions list.

      The classes must be ordered such that a given class's superclass and
      implemented interfaces appear in the list earlier than the referring class.

      Furthermore, it is invalid for a definition for the same-named class to
      appear more than once in the list.
  #call_site_ids:
  #  pos: header.???
  #  type: call_site_id_item
  #  repeat: expr
  #  repeat-expr: header.???
  #  doc: |
  #    call site identifiers list.
  #
  #    These are identifiers for all call sites referred to by this file,
  #    whether defined in the file or not.
  #
  #    This list must be sorted in ascending order of call_site_off.
  link_data:
    pos: header.link_off
    size: header.link_size
    doc: |
      data used in statically linked files.

      The format of the data in this section is left unspecified by this document.

      This section is empty in unlinked files, and runtime implementations may
      use it as they see fit.
  data:
    pos: header.data_off
    size: header.data_size
    doc: |
      data area, containing all the support data for the tables listed above.

      Different items have different alignment requirements, and padding bytes
      are inserted before each item if necessary to achieve proper alignment.
  map:
    pos: header.map_off
    type: map_list
types:
  header_item:
    seq:
      - id: magic
        contents: "dex\n"
      - id: version_str
        size: 4
        type: strz
        encoding: ascii
      - id: checksum
        type: u4
        doc: |
          adler32 checksum of the rest of the file (everything but magic and this field);
          used to detect file corruption
      - id: signature
        size: 20
        doc: |
          SHA-1 signature (hash) of the rest of the file (everything but magic, checksum,
          and this field); used to uniquely identify files
      - id: file_size
        type: u4
        doc: |
          size of the entire file (including the header), in bytes
      - id: header_size
        type: u4
        # guard: 0x70
        doc: |
          size of the header (this entire section), in bytes. This allows for at
          least a limited amount of backwards/forwards compatibility without
          invalidating the format.
      - id: endian_tag
        type: u4
        enum: endian_constant
      - id: link_size
        type: u4
        doc: |
          size of the link section, or 0 if this file isn't statically linked
      - id: link_off
        type: u4
        doc: |
          offset from the start of the file to the link section, or 0 if link_size == 0.
          The offset, if non-zero, should be to an offset into the link_data section.
          The format of the data pointed at is left unspecified by this document;
          this header field (and the previous) are left as hooks for use by runtime implementations.
      - id: map_off
        type: u4
        doc: |
          offset from the start of the file to the map item.
          The offset, which must be non-zero, should be to an offset into the data
          section, and the data should be in the format specified by "map_list" below.
      - id: string_ids_size
        type: u4
        doc: |
          count of strings in the string identifiers list
      - id: string_ids_off
        type: u4
        doc: |
          offset from the start of the file to the string identifiers list,
          or 0 if string_ids_size == 0 (admittedly a strange edge case).
          The offset, if non-zero, should be to the start of the string_ids section.
      - id: type_ids_size
        type: u4
        doc: |
          count of elements in the type identifiers list, at most 65535
      - id: type_ids_off
        type: u4
        doc: |
          offset from the start of the file to the type identifiers list,
          or 0 if type_ids_size == 0 (admittedly a strange edge case).
          The offset, if non-zero, should be to the start of the type_ids section.
      - id: proto_ids_size
        type: u4
        doc: |
          count of elements in the prototype identifiers list, at most 65535
      - id: proto_ids_off
        type: u4
        doc: |
          offset from the start of the file to the prototype identifiers list,
          or 0 if proto_ids_size == 0 (admittedly a strange edge case).
          The offset, if non-zero, should be to the start of the proto_ids section.
      - id: field_ids_size
        type: u4
        doc: |
          count of elements in the field identifiers list
      - id: field_ids_off
        type: u4
        doc: |
          offset from the start of the file to the field identifiers list,
          or 0 if field_ids_size == 0.
          The offset, if non-zero, should be to the start of the field_ids section.
      - id: method_ids_size
        type: u4
        doc: |
          count of elements in the method identifiers list
      - id: method_ids_off
        type: u4
        doc: |
          offset from the start of the file to the method identifiers list,
          or 0 if method_ids_size == 0.
          The offset, if non-zero, should be to the start of the method_ids section.
      - id: class_defs_size
        type: u4
        doc: |
          count of elements in the class definitions list
      - id: class_defs_off
        type: u4
        doc: |
          offset from the start of the file to the class definitions list,
          or 0 if class_defs_size == 0 (admittedly a strange edge case).
          The offset, if non-zero, should be to the start of the class_defs section.
      - id: data_size
        type: u4
        doc: |
          Size of data section in bytes. Must be an even multiple of sizeof(uint).
      - id: data_off
        type: u4
        doc: |
          offset from the start of the file to the start of the data section.
    enums:
      endian_constant:
        0x12345678: endian_constant
        0x78563412: reverse_endian_constant
  string_id_item:
    -webide-representation: "{value.data} (offs={string_data_off})"
    seq:
      - id: string_data_off
        type: u4
        doc: |
          offset from the start of the file to the string data for this item.
          The offset should be to a location in the data section, and the data
          should be in the format specified by "string_data_item" below.
          There is no alignment requirement for the offset.
    types:
      string_data_item:
        -webide-representation: "{data}"
        seq:
          - id: utf16_size
            type: vlq_base128_le
          - id: data
            size: utf16_size.value
            type: str
            encoding: ascii
    instances:
      value:
        pos: string_data_off
        type: string_data_item
        -webide-parse-mode: eager
  type_id_item:
    -webide-representation: "{type_name}"
    seq:
      - id: descriptor_idx
        type: u4
        doc: |
          index into the string_ids list for the descriptor string of this type.
          The string must conform to the syntax for TypeDescriptor, defined above.
    instances:
      type_name:
        value: _root.string_ids[descriptor_idx].value.data
        -webide-parse-mode: eager
  proto_id_item:
    -webide-representation: "shorty_idx={shorty_idx} return_type_idx={return_type_idx} parameters_off={parameters_off}"
    seq:
      - id: shorty_idx
        type: u4
        doc: |
          index into the string_ids list for the short-form descriptor string of this prototype.
          The string must conform to the syntax for ShortyDescriptor, defined above,
          and must correspond to the return type and parameters of this item.
      - id: return_type_idx
        type: u4
        doc: |
          index into the type_ids list for the return type of this prototype
      - id: parameters_off
        type: u4
        doc: |
          offset from the start of the file to the list of parameter types for this prototype,
          or 0 if this prototype has no parameters.
          This offset, if non-zero, should be in the data section, and the data
          there should be in the format specified by "type_list" below.
          Additionally, there should be no reference to the type void in the list.
    instances:
      shorty_desc:
        value: _root.string_ids[shorty_idx].value.data
        doc: short-form descriptor string of this prototype, as pointed to by shorty_idx
      params_types:
        io: _root._io
        pos: parameters_off
        type: type_list
        if: parameters_off != 0
        doc: list of parameter types for this prototype
      return_type:
        value: _root.type_ids[return_type_idx].type_name
        doc: return type of this prototype
  field_id_item:
    -webide-representation: "class_idx={class_idx} type_idx={type_idx} name_idx={name_idx}"
    seq:
      - id: class_idx
        type: u2
        doc: |
          index into the type_ids list for the definer of this field.
          This must be a class type, and not an array or primitive type.
      - id: type_idx
        type: u2
        doc: |
          index into the type_ids list for the type of this field
      - id: name_idx
        type: u4
        doc: |
          index into the string_ids list for the name of this field.
          The string must conform to the syntax for MemberName, defined above.
    instances:
      class_name:
        value: _root.type_ids[class_idx].type_name
        doc: the definer of this field
      type_name:
        value: _root.type_ids[type_idx].type_name
        doc: the type of this field
      field_name:
        value: _root.string_ids[name_idx].value.data
        doc: the name of this field
  method_id_item:
    -webide-representation: "class_idx={class_idx} proto_idx={proto_idx} name_idx={name_idx}"
    seq:
      - id: class_idx
        type: u2
        doc: |
          index into the type_ids list for the definer of this method.
          This must be a class or array type, and not a primitive type.
      - id: proto_idx
        type: u2
        doc: |
          index into the proto_ids list for the prototype of this method
      - id: name_idx
        type: u4
        doc: |
          index into the string_ids list for the name of this method.
          The string must conform to the syntax for MemberName, defined above.
    instances:
      class_name:
        value: _root.type_ids[class_idx].type_name
        doc: the definer of this method
      proto_desc:
        value: _root.proto_ids[proto_idx].shorty_desc
        doc: the short-form descriptor of the prototype of this method
      method_name:
        value: _root.string_ids[name_idx].value.data
        doc: the name of this method
  class_def_item:
    -webide-representation: "{access_flags} {type_name}"
    seq:
      - id: class_idx
        type: u4
        doc: |
          index into the type_ids list for this class.

          This must be a class type, and not an array or primitive type.
      - id: access_flags
        type: u4
        enum: class_access_flags
        doc: |
          access flags for the class (public, final, etc.).

          See "access_flags Definitions" for details.
      - id: superclass_idx
        type: u4
        doc: |
          index into the type_ids list for the superclass,
          or the constant value NO_INDEX if this class has no superclass
          (i.e., it is a root class such as Object).

          If present, this must be a class type, and not an array or primitive type.
      - id: interfaces_off
        type: u4
        doc: |
          offset from the start of the file to the list of interfaces, or 0 if there are none.

          This offset should be in the data section, and the data there should
          be in the format specified by "type_list" below. Each of the elements
          of the list must be a class type (not an array or primitive type),
          and there must not be any duplicates.
      - id: source_file_idx
        type: u4
        doc: |
          index into the string_ids list for the name of the file containing
          the original source for (at least most of) this class, or the
          special value NO_INDEX to represent a lack of this information.

          The debug_info_item of any given method may override this source file,
          but the expectation is that most classes will only come from one source file.
      - id: annotations_off
        type: u4
        doc: |
          offset from the start of the file to the annotations structure for
          this class, or 0 if there are no annotations on this class.

          This offset, if non-zero, should be in the data section, and the data
          there should be in the format specified by "annotations_directory_item"
          below,with all items referring to this class as the definer.
      - id: class_data_off
        type: u4
        doc: |
          offset from the start of the file to the associated class data for this
          item, or 0 if there is no class data for this class.

          (This may be the case, for example, if this class is a marker interface.)

          The offset, if non-zero, should be in the data section, and the data
          there should be in the format specified by "class_data_item" below,
          with all items referring to this class as the definer.
      - id: static_values_off
        type: u4
        doc: |
          offset from the start of the file to the list of initial values for
          static fields, or 0 if there are none (and all static fields are to be
          initialized with 0 or null).

          This offset should be in the data section, and the data there should
          be in the format specified by "encoded_array_item" below.

          The size of the array must be no larger than the number of static fields
          declared by this class, and the elements correspond to the static fields
          in the same order as declared in the corresponding field_list.

          The type of each array element must match the declared type of its
          corresponding field.

          If there are fewer elements in the array than there are static fields,
          then the leftover fields are initialized with a type-appropriate 0 or null.
    instances:
      type_name:
        value: _root.type_ids[class_idx].type_name
        -webide-parse-mode: eager
      class_data:
        pos: class_data_off
        type: class_data_item
        if: class_data_off != 0
      static_values:
        pos: static_values_off
        type: encoded_array_item
        if: static_values_off != 0
  encoded_array_item:
    seq:
      - id: value
        type: encoded_array
  annotation_element:
    seq:
      - id: name_idx
        type: vlq_base128_le
        doc: |
          element name, represented as an index into the string_ids section.

          The string must conform to the syntax for MemberName, defined above.
      - id: value
        type: encoded_value
        doc: |
          element value
  encoded_annotation:
    seq:
      - id: type_idx
        type: vlq_base128_le
        doc: |
          type of the annotation.

          This must be a class (not array or primitive) type.
      - id: size
        type: vlq_base128_le
        doc: |
          number of name-value mappings in this annotation
      - id: elements
        type: annotation_element
        repeat: expr
        repeat-expr: size.value
        doc: |
          elements of the annotation, represented directly in-line (not as offsets).

          Elements must be sorted in increasing order by string_id index.
  encoded_value:
    -webide-representation: "{value_type}: {value} (arg={value_arg})"
    seq:
      - id: value_arg
        type: b3
      - id: value_type
        type: b5
        enum: value_type_enum
      - id: value
        type:
          switch-on: value_type
          cases:
            # TODO: dynamic sizes
            value_type_enum::byte:          s1
            value_type_enum::short:         s2
            value_type_enum::char:          u2
            value_type_enum::int:           s4
            value_type_enum::long:          s8
            value_type_enum::float:         f4
            value_type_enum::double:        f8
            value_type_enum::method_type:   u4
            value_type_enum::method_handle: u4
            value_type_enum::string:        u4
            value_type_enum::type:          u4
            value_type_enum::field:         u4
            value_type_enum::method:        u4
            value_type_enum::enum:          u4
            value_type_enum::array:         encoded_array
            value_type_enum::annotation:    encoded_annotation
    enums:
      value_type_enum:
        0x00: byte
        0x02: short
        0x03: char
        0x04: int
        0x06: long
        0x10: float
        0x11: double
        0x15: method_type
        0x16: method_handle
        0x17: string
        0x18: type
        0x19: field
        0x1a: method
        0x1b: enum
        0x1c: array
        0x1d: annotation
        0x1e: "null"
        0x1f: boolean
  encoded_array:
    seq:
      - id: size
        type: vlq_base128_le
      - id: values
        type: encoded_value
        repeat: expr
        repeat-expr: size.value
  call_site_id_item:
    seq:
      - id: call_site_off
        type: u4
        doc: |
          offset from the start of the file to call site definition.

          The offset should be in the data section, and the data there should
          be in the format specified by "call_site_item" below.
  encoded_field:
    seq:
      - id: field_idx_diff
        type: vlq_base128_le
        doc: |
          index into the field_ids list for the identity of this field
          (includes the name and descriptor), represented as a difference
          from the index of previous element in the list.

          The index of the first element in a list is represented directly.
      - id: access_flags
        type: vlq_base128_le
        doc: |
          access flags for the field (public, final, etc.).

          See "access_flags Definitions" for details.
  encoded_method:
    seq:
      - id: method_idx_diff
        type: vlq_base128_le
        doc: |
          index into the method_ids list for the identity of this method
          (includes the name and descriptor), represented as a difference
          from the index of previous element in the list.

          The index of the first element in a list is represented directly.
      - id: access_flags
        type: vlq_base128_le
        doc: |
          access flags for the field (public, final, etc.).

          See "access_flags Definitions" for details.
      - id: code_off
        type: vlq_base128_le
        doc: |
          offset from the start of the file to the code structure for this method,
          or 0 if this method is either abstract or native.

          The offset should be to a location in the data section.

          The format of the data is specified by "code_item" below.
  class_data_item:
    seq:
      - id: static_fields_size
        type: vlq_base128_le
        doc: |
          the number of static fields defined in this item
      - id: instance_fields_size
        type: vlq_base128_le
        doc: |
          the number of instance fields defined in this item
      - id: direct_methods_size
        type: vlq_base128_le
        doc: |
          the number of direct methods defined in this item
      - id: virtual_methods_size
        type: vlq_base128_le
        doc: |
          the number of virtual methods defined in this item
      - id: static_fields
        type: encoded_field
        repeat: expr
        repeat-expr: static_fields_size.value
        doc: |
          the defined static fields, represented as a sequence of encoded elements.

          The fields must be sorted by field_idx in increasing order.
      - id: instance_fields
        type: encoded_field
        repeat: expr
        repeat-expr: instance_fields_size.value
        doc: |
          the defined instance fields, represented as a sequence of encoded elements.

          The fields must be sorted by field_idx in increasing order.
      - id: direct_methods
        type: encoded_method
        repeat: expr
        repeat-expr: direct_methods_size.value
        doc: |
          the defined direct (any of static, private, or constructor) methods,
          represented as a sequence of encoded elements.

          The methods must be sorted by method_idx in increasing order.
      - id: virtual_methods
        type: encoded_method
        repeat: expr
        repeat-expr: virtual_methods_size.value
        doc: |
          the defined virtual (none of static, private, or constructor) methods,
          represented as a sequence of encoded elements.

          This list should not include inherited methods unless overridden by
          the class that this item represents.

          The methods must be sorted by method_idx in increasing order.

          The method_idx of a virtual method must not be the same as any direct method.
  map_item:
    -webide-representation: "{type}: offs={offset}, size={size}"
    seq:
      - id: type
        type: u2
        enum: map_item_type
        doc: |
          type of the items; see table below
      - id: unused
        type: u2
        doc: |
          (unused)
      - id: size
        type: u4
        doc: |
          count of the number of items to be found at the indicated offset
      - id: offset
        type: u4
        doc: |
          offset from the start of the file to the items in question
    enums:
      map_item_type:
        0x0000: header_item
        0x0001: string_id_item
        0x0002: type_id_item
        0x0003: proto_id_item
        0x0004: field_id_item
        0x0005: method_id_item
        0x0006: class_def_item
        0x0007: call_site_id_item
        0x0008: method_handle_item
        0x1000: map_list
        0x1001: type_list
        0x1002: annotation_set_ref_list
        0x1003: annotation_set_item
        0x2000: class_data_item
        0x2001: code_item
        0x2002: string_data_item
        0x2003: debug_info_item
        0x2004: annotation_item
        0x2005: encoded_array_item
        0x2006: annotations_directory_item
  map_list:
    seq:
      - id: size
        type: u4
      - id: list
        type: map_item
        repeat: expr
        repeat-expr: size
  type_item:
    seq:
      - id: type_idx
        type: u2
    instances:
      value:
        value: _root.type_ids[type_idx].type_name
  type_list:
    seq:
      - id: size
        type: u4
      - id: list
        type: type_item
        repeat: expr
        repeat-expr: size
enums:
  class_access_flags:
    0x0001: public     # public: visible everywhere
    0x0002: private    # * private: only visible to defining class
    0x0004: protected  # * protected: visible to package and subclasses
    0x0008: static     # * static: is not constructed with an outer this reference
    0x0010: final      # final: not subclassable
    0x0200: interface  # interface: multiply-implementable abstract class
    0x0400: abstract   # abstract: not directly instantiable
    0x1000: synthetic  # not directly defined in source code
    0x2000: annotation # declared as an annotation class
    0x4000: enum       # declared as an enumerated type