UTF-16 string with BOM: Python parsing library

A simple wrapper which allows to read a UTF-16 encoded string that starts with a byte order mark (BOM). The BOM indicates the endianness of the UTF-16 encoding, which can be either big-endian (BE) or little-endian (LE).

Use:

  • value to get the string value with BOM stripped, regardless of endianness.
  • is_be and is_le to check the endianness indicated by the BOM.
  • bom to check the raw byte order mark.

KS implementation details

License: CC0-1.0
Minimal Kaitai Struct required: 0.9

References

This page hosts a formal specification of UTF-16 string with BOM using Kaitai Struct. This specification can be automatically translated into a variety of programming languages to get a parsing library.

Usage

Runtime library

All Python code generated by Kaitai Struct depends on the Kaitai Struct runtime library for Python. You must add this dependency to your project before you can parse or serialize any data.

The Python runtime library can be installed from PyPI:

python3 -m pip install kaitaistruct

Code

Parse a local file and get structure in memory:

data = Utf16WithBom.from_file("path/to/local/file.bin")

Or parse structure from a bytes:

from kaitaistruct import KaitaiStream, BytesIO

raw = b"\x00\x01\x02..."
data = Utf16WithBom(KaitaiStream(BytesIO(raw)))

After that, one can get various attributes from the structure by invoking getter methods like:

data.bom # => The byte order mark (BOM) is a special marker at the beginning of the
string that indicates the endianness of the UTF-16 encoding. The
character U+FEFF is used as the BOM, and its byte representation differs
based on endianness:

* For big-endian (BE) UTF-16, it's `[0xFE, 0xFF]`
* For little-endian (LE) UTF-16, it's `[0xFF, 0xFE]`

This implementation checks for the presence of a valid BOM and strips it
from the resulting string value.

data.is_be # => True if the byte order mark indicates big-endian UTF-16 encoding.

Python source code to parse UTF-16 string with BOM

utf16_with_bom.py

# This is a generated file! Please edit source .ksy file and use kaitai-struct-compiler to rebuild
# type: ignore

import kaitaistruct
from kaitaistruct import KaitaiStruct, KaitaiStream, BytesIO


if getattr(kaitaistruct, 'API_VERSION', (0, 9)) < (0, 11):
    raise Exception("Incompatible Kaitai Struct Python API: 0.11 or later is required, but you have %s" % (kaitaistruct.__version__))

class Utf16WithBom(KaitaiStruct):
    """A simple wrapper which allows to read a UTF-16 encoded string that starts
    with a byte order mark (BOM). The BOM indicates the endianness of the UTF-16
    encoding, which can be either big-endian (BE) or little-endian (LE).
    
    Use:
    
    * `value` to get the string value with BOM stripped, regardless of endianness.
    * `is_be` and `is_le` to check the endianness indicated by the BOM.
    * `bom` to check the raw byte order mark.
    
    .. seealso::
       - https://en.wikipedia.org/wiki/Byte_order_mark
    """
    def __init__(self, _io, _parent=None, _root=None):
        super(Utf16WithBom, self).__init__(_io)
        self._parent = _parent
        self._root = _root or self
        self._read()

    def _read(self):
        self.bom = self._io.read_bytes(2)
        if not  ((self.bom == b"\xFE\xFF") or (self.bom == b"\xFF\xFE")) :
            raise kaitaistruct.ValidationNotAnyOfError(self.bom, self._io, u"/seq/0")
        if self.is_be:
            pass
            self.str_be = (self._io.read_bytes_full()).decode(u"UTF-16BE")

        if self.is_le:
            pass
            self.str_le = (self._io.read_bytes_full()).decode(u"UTF-16LE")



    def _fetch_instances(self):
        pass
        if self.is_be:
            pass

        if self.is_le:
            pass


    @property
    def is_be(self):
        """True if the byte order mark indicates big-endian UTF-16 encoding."""
        if hasattr(self, '_m_is_be'):
            return self._m_is_be

        self._m_is_be = self.bom == b"\xFE\xFF"
        return getattr(self, '_m_is_be', None)

    @property
    def is_le(self):
        """True if the byte order mark indicates little-endian UTF-16 encoding."""
        if hasattr(self, '_m_is_le'):
            return self._m_is_le

        self._m_is_le = self.bom == b"\xFF\xFE"
        return getattr(self, '_m_is_le', None)

    @property
    def value(self):
        """The string value with BOM stripped, regardless of endianness."""
        if hasattr(self, '_m_value'):
            return self._m_value

        self._m_value = (self.str_be if self.is_be else self.str_le)
        return getattr(self, '_m_value', None)