LArPix raw HDF5 Format

This is an alternative to the ``larpix.format.hdf5format``format that allows for much faster conversion to file at the expense of human readability.

To use, pass a list of bytestring messages into the to_rawfile() method:

msgs = [b'this is a test message', b'this is a different message']
to_rawfile('raw.h5', msgs)

To access the data in the file, the inverse method from_rawfile() is used:

rd = from_rawfile('raw.h5')
rd['msgs'] # [b'this is a test message', b'this is a different message']

Messages may be recieved from multiple io_group sources, in this case, a per-message header with io_group can be specified as a list of integers of the same length as the msgs list and passed into the file at the same time:

msgs = [b'message from 1', b'message from 2']
io_groups = [1, 2]
to_rawfile('raw.h5', msgs=msgs, msg_headers={'io_groups': io_groups})

rd = from_rawfile('raw.h5')
rd['msgs'] # [b'message from 1', b'message from 2']
rd['msg_headers']['io_groups'] # [1, 2]

File versioning

Some version validation is included with the file format through the version and io_version file metadata. When creating a new file, a file format version can be provided with the version keyword argument as a string formatted 'major.minor':

to_rawfile('raw_v0_0.h5', version='0.0')

Subsequent writes to the file will only occur if the requested file version and the existing file versions are compatible. Incompatiblity occurs if there is a difference in the major version number or the minor version number is less than the requested file version:

to_rawfile('raw_v0_0.h5', version='0.1') # fails due to minor version incompatibility
to_rawfile('raw_v0_0.h5', version='1.0') # fails due to major version incompatibility

By default, the most recent file version is used.

On the file read side, a version number can be requested and the file will be parsed assuming a specific version:

from_rawfile('raw_v0_0.h5', version='0.0')
from_rawfile('raw_v0_0.h5', version='0.1') # fails due to minor version incompatiblity
from_rawfile('raw_v0_0.h5', version='1.0') # fails due to major version compatibility

The io_version optional metadata marks the version of the io message format that was used to encode the message bytestrings. If present as a keyword argument when writing to the file, an AssertionError will be raised if the io version is incompatible with the existing one stored in metadata:

to_rawfile('raw_io_v0_0.h5', io_version='0.0')
to_rawfile('raw_io_v0_0.h5', io_version='0.1') # fails due to minor version incompatibility
to_rawfile('raw_io_v0_0.h5', io_version='1.0') # fails due to major version incompatibility

A similar mechanism occurs when requesting an io version when reading from the file:

from_rawfile('raw_io_v0_0.h5', io_version='0.1') # fails due to minor version incompatibility
from_rawfile('raw_io_v0_0.h5', io_version='1.0') # fails due to major version
from_rawfile('raw_io_v0_0.h5', io_version='0.0')

I think it is worthwhile to further clarify the io_version and the file version, as this might be confusing. In particular, you might be asking, “What io versions are compatible with what file versions?” The rawhdf5format is a way of wrapping raw binary data into a format that only requires HDF5 to parse. The file version represents this HDF5 structuring (the hdf5 dataset formats, file metadata, what message header data is available). Whereas the io_version represents the formatting of the binary data that the file contains. So the answer to that question is: all file versions are compatible with all io versions.

Converting to other file types

This format was created with a specific application in mind - provide a temporary but fast file format for PACMAN messages. When used in this case, to convert to the standard larpix.format.hdf5format:

from larpix.format.pacman_msg_format import parse
from larpix.format.hdf5format import to_file

rd = from_rawfile('raw.h5')
pkts = list()
for io_group,msg in zip(rd['msg_headers']['io_groups'], rd['msgs']):
    pkts.extend(parse(msg, io_group=io_group))
to_file('new_filename.h5', packet_list=pkts)

but as always, the most efficient means of accessing the data is to operate on the data itself, rather than converting between types.

Metadata (v0.0)

The group meta contains file metadata stored as attributes:

  • created: float, unix timestamp since the 1970 epoch in seconds indicating when file was first created
  • modified: float, unix timestamp since the 1970 epoch in seconds indicating when the file was last written to
  • version: str, file version, formatted as 'major.minor'
  • io_version: str, optional version for message bytestring encoding, formatted as 'major.minor'

Datasets (v0.0)

The hdf5 format contains two datasets msgs and msg_headers:

  • msgs: shape (N,); variable-length uint1 arrays encoding each message bytestring

  • msg_headers: shape (N,); numpy structured array with fields:

    • 'io_group': uint1 representing the io_group associated with each message
larpix.format.rawhdf5format.latest_version = '0.0'

Most up-to-date raw larpix hdf5 format version.

larpix.format.rawhdf5format.dataset_dtypes

Description of the datasets and their dtypes used in each version of the raw larpix hdf5 format.

Structured as dataset_dtypes['<version>']['<dataset>'] = <dtype>.

larpix.format.rawhdf5format.to_rawfile(filename, msgs=None, version=None, msg_headers=None, io_version=None)[source]

Write a list of bytestring messages to an hdf5 file. If the file exists, the messages will appended to the end of the dataset.

Parameters:
  • filename – desired filename for the file to write or update
  • msgs – iterable of variable-length bytestrings to write to the file. If None specified, will only create file and update metadata.
  • version – a string of major.minor version desired. If None specified, will use the latest file format version (if new file) or version in file (if updating an existing file).
  • msg_headers – a dict of iterables to associate with each message header. Iterables must be same length as msgs. If None specified, will use a default value of 0 for each message. Keys are dtype field names specified in dataset_dtypes[version]['msg_headers'].names
  • io_version – optional metadata to associate with file corresponding to the io format version of the bytestring messages. Throws RuntimeError if version incompatibility encountered in an existing file.
larpix.format.rawhdf5format.len_rawfile(filename, attempts=1)[source]

Check the total number of messages in a file

Parameters:
  • filename – filename to check
  • attempts – a parameter only relevant if file is being actively written to by another process, specifies number of refreshes to try if a synchronized state between the datasets is not achieved. A value less than 0 busy blocks until a synchronized state is achieved. A value greater than 0 tries to achieve synchronization a max of attempts before throwing a RuntimeError. And a value of 0 does not attempt to synchronize (not recommended).
Returns:

int number of messages in file

larpix.format.rawhdf5format.from_rawfile(filename, start=None, end=None, version=None, io_version=None, msg_headers_only=False, mask=None, attempts=1)[source]

Read a chunk of bytestring messages from an existing file

Parameters:
  • filename – filename to read bytestrings from
  • start – index for the start position when reading from the file (default = None). If a value less than 0 is specified, index is relative to the end of the file. If None is specified, data is read from the start of the file. If a mask is specified, does nothing.
  • end – index for the end position when reading from the file (default = None). If a value less than 0 is specified, index is relative to the end of the file. If None is specified, data is read until the end of the file. If a mask is specified, does nothing.
  • version – required version compatibility. If None specified, uses the version stored in the file metadata
  • io_version – required io version compatibility. If None specified, does not check the io_version file metadata
  • msg_headers_only – optional flag to only load header information and not message bytestrings ('msgs' value in return dict will be None if msg_headers_only=True)
  • mask – boolean mask alternative to start and end chunk specification to indicate specific file rows to load. Boolean 1D array with length equal to len_rawfile(filename)
  • attempts – a parameter only relevant if file is being actively written to by another process, specifies number of refreshes to try if a synchronized state between the datasets is not achieved. A value less than 0 busy blocks until a synchronized state is achieved. A value greater than 0 tries to achieve synchronization a max of attempts before throwing a RuntimeError. And a value of 0 does not attempt to synchronize (not recommended).
Returns:

dict with keys for 'created', 'modified', 'version', and 'io_version' metadata, along with 'msgs' (a list of bytestring messages) and 'msg_headers' (a dict with message header field name: list of message header field data, 1 per message)