Gzip is a popular and standard single-file archiving format. It essentially provides a container that stores original file name, timestamp and a few other things (like optional comment), basic CRCs, etc, and a file compressed by a chosen compression algorithm.
As of 2019, there is actually only one working solution for compression algorithms, so it's typically raw DEFLATE stream (without zlib header) in all gzipped files.
This page hosts a formal specification of .gz file format using Kaitai Struct. This specification can be automatically translated into a variety of programming languages to get a parsing library.
meta:
id: gzip
file-extension: gz
xref:
forensicswiki: gzip
justsolve: Gzip
mime: application/gzip
pronom: x-fmt/266
rfc: 1952
wikidata: Q10287816
license: CC0-1.0
endian: le
doc: |
Gzip is a popular and standard single-file archiving format. It
essentially provides a container that stores original file name,
timestamp and a few other things (like optional comment), basic
CRCs, etc, and a file compressed by a chosen compression algorithm.
As of 2019, there is actually only one working solution for
compression algorithms, so it's typically raw DEFLATE stream
(without zlib header) in all gzipped files.
doc-ref: https://www.rfc-editor.org/rfc/rfc1952
seq:
- id: magic
-orig-id: ID1, ID2
contents: [0x1f, 0x8b]
- id: compression_method
-orig-id: CM
type: u1
enum: compression_methods
doc: |
Compression method used to compress file body. In practice, only
one method is widely used: 8 = deflate.
- id: flags
-orig-id: FLG
type: flags
- id: mod_time
-orig-id: MTIME
type: u4
doc: Last modification time of a file archived in UNIX timestamp format.
- id: extra_flags
-orig-id: XFL
type:
switch-on: compression_method
cases:
'compression_methods::deflate': extra_flags_deflate
doc: Extra flags, specific to compression method chosen.
- id: os
-orig-id: OS
type: u1
enum: oses
doc: OS used to compress this file.
- id: extras
type: extras
if: flags.has_extra
- id: name
terminator: 0
if: flags.has_name
- id: comment
terminator: 0
if: flags.has_comment
- id: header_crc16
type: u2
if: flags.has_header_crc
- id: body
size: _io.size - _io.pos - 8
doc: |
Compressed body of a file archived. Note that we don't make an
attempt to decompress it here.
- id: body_crc32
-orig-id: CRC32
type: u4
doc: |
CRC32 checksum of an uncompressed file body
- id: len_uncompressed
-orig-id: ISIZE
type: u4
doc: |
Size of original uncompressed data in bytes (truncated to 32
bits).
enums:
compression_methods:
8: deflate
oses:
0:
id: fat
doc: FAT filesystem (MS-DOS, OS/2, NT/Win32)
1:
id: amiga
doc: Amiga
2:
id: vms
doc: VMS (or OpenVMS)
3:
id: unix
doc: Unix
4:
id: vm_cms
doc: VM/CMS
5:
id: atari_tos
doc: Atari TOS
6:
id: hpfs
doc: HPFS filesystem (OS/2, NT)
7:
id: macintosh
doc: Macintosh
8:
id: z_system
doc: Z-System
9:
id: cp_m
doc: CP/M
10:
id: tops_20
doc: TOPS-20
11:
id: ntfs
doc: NTFS filesystem (NT)
12:
id: qdos
doc: QDOS
13:
id: acorn_riscos
doc: Acorn RISCOS
255:
id: unknown
types:
flags:
seq:
- id: reserved1
type: b3
- id: has_comment
-orig-id: FCOMMENT
type: b1
- id: has_name
-orig-id: FNAME
type: b1
- id: has_extra
-orig-id: FEXTRA
type: b1
doc: If true, optional extra fields are present in the archive.
- id: has_header_crc
-orig-id: FHCRC
type: b1
doc: |
If true, this archive includes a CRC16 checksum for the header.
- id: is_text
-orig-id: FTEXT
type: b1
doc: |
If true, file inside this archive is a text file from
compressor's point of view.
extra_flags_deflate:
seq:
- id: compression_strength
type: u1
enum: compression_strengths
enums:
compression_strengths:
2: best
4: fast
extras:
seq:
- id: len_subfields
-orig-id: XLEN
type: u2
- id: subfields
size: len_subfields
type: subfields
subfields:
doc: |
Container for many subfields, constrained by size of stream.
seq:
- id: entries
type: subfield
repeat: eos
subfield:
doc: |
Every subfield follows typical [TLV scheme](https://en.wikipedia.org/wiki/Type-length-value):
* `id` serves role of "T"ype
* `len_data` serves role of "L"ength
* `data` serves role of "V"alue
This way it's possible to for arbitrary parser to skip over
subfields it does not support.
seq:
- id: id
-orig-id: SI1, SI2
type: u2
doc: Subfield ID, typically two ASCII letters.
- id: len_data
type: u2
- id: data
size: len_data