LArPix raw HDF5 Format¶
This is an alternative to the ``larpix.format.hdf5format``format that allows for much faster conversion to file at the expense of human readability.
To use, pass a list of bytestring messages into the to_rawfile() method:
msgs = [b'this is a test message', b'this is a different message']
to_rawfile('raw.h5', msgs)
To access the data in the file, the inverse method from_rawfile() is used:
rd = from_rawfile('raw.h5')
rd['msgs'] # [b'this is a test message', b'this is a different message']
Messages may be recieved from multiple io_group sources, in this case, a
per-message header with io_group can be specified as a list of integers of
the same length as the msgs list and passed into the file at the same time:
msgs = [b'message from 1', b'message from 2']
io_groups = [1, 2]
to_rawfile('raw.h5', msgs=msgs, msg_headers={'io_groups': io_groups})
rd = from_rawfile('raw.h5')
rd['msgs'] # [b'message from 1', b'message from 2']
rd['msg_headers']['io_groups'] # [1, 2]
File versioning¶
Some version validation is included with the file format through
the version and io_version file metadata. When creating a new file, a
file format version can be provided with the version keyword argument as
a string formatted 'major.minor':
to_rawfile('raw_v0_0.h5', version='0.0')
Subsequent writes to the file will only occur if the requested file version and the existing file versions are compatible. Incompatiblity occurs if there is a difference in the major version number or the minor version number is less than the requested file version:
to_rawfile('raw_v0_0.h5', version='0.1') # fails due to minor version incompatibility
to_rawfile('raw_v0_0.h5', version='1.0') # fails due to major version incompatibility
By default, the most recent file version is used.
On the file read side, a version number can be requested and the file will be parsed assuming a specific version:
from_rawfile('raw_v0_0.h5', version='0.0')
from_rawfile('raw_v0_0.h5', version='0.1') # fails due to minor version incompatiblity
from_rawfile('raw_v0_0.h5', version='1.0') # fails due to major version compatibility
The io_version optional metadata marks the version of the io message format
that was used to encode the message bytestrings. If present as a keyword
argument when writing to the file, an AssertionError will be raised if
the io version is incompatible with the existing one stored in metadata:
to_rawfile('raw_io_v0_0.h5', io_version='0.0')
to_rawfile('raw_io_v0_0.h5', io_version='0.1') # fails due to minor version incompatibility
to_rawfile('raw_io_v0_0.h5', io_version='1.0') # fails due to major version incompatibility
A similar mechanism occurs when requesting an io version when reading from the file:
from_rawfile('raw_io_v0_0.h5', io_version='0.1') # fails due to minor version incompatibility
from_rawfile('raw_io_v0_0.h5', io_version='1.0') # fails due to major version
from_rawfile('raw_io_v0_0.h5', io_version='0.0')
I think it is worthwhile to further clarify the io_version and the file
version, as this might be confusing. In particular, you might be asking,
“What io versions are compatible with what file versions?” The rawhdf5format
is a way of wrapping raw binary data into a format that only requires HDF5 to
parse. The file version represents this HDF5 structuring (the hdf5 dataset
formats, file metadata, what message header data is available). Whereas the
io_version represents the formatting of the binary data that the file
contains. So the answer to that question is: all file versions are compatible
with all io versions.
Converting to other file types¶
This format was created with a specific application in mind - provide a
temporary but fast file format for PACMAN messages. When used in this
case, to convert to the standard larpix.format.hdf5format:
from larpix.format.pacman_msg_format import parse
from larpix.format.hdf5format import to_file
rd = from_rawfile('raw.h5')
pkts = list()
for io_group,msg in zip(rd['msg_headers']['io_groups'], rd['msgs']):
pkts.extend(parse(msg, io_group=io_group))
to_file('new_filename.h5', packet_list=pkts)
but as always, the most efficient means of accessing the data is to operate on the data itself, rather than converting between types.
Metadata (v0.0)¶
The group meta contains file metadata stored as attributes:
created:float, unix timestamp since the 1970 epoch in seconds indicating when file was first createdmodified:float, unix timestamp since the 1970 epoch in seconds indicating when the file was last written toversion:str, file version, formatted as'major.minor'io_version:str, optional version for message bytestring encoding, formatted as'major.minor'
Datasets (v0.0)¶
The hdf5 format contains two datasets msgs and msg_headers:
msgs: shape(N,); variable-lengthuint1arrays encoding each message bytestring
msg_headers: shape(N,); numpy structured array with fields:
'io_group':uint1representing theio_groupassociated with each message
-
larpix.format.rawhdf5format.latest_version= '0.0'¶ Most up-to-date raw larpix hdf5 format version.
-
larpix.format.rawhdf5format.dataset_dtypes¶ Description of the datasets and their dtypes used in each version of the raw larpix hdf5 format.
Structured as
dataset_dtypes['<version>']['<dataset>'] = <dtype>.
-
larpix.format.rawhdf5format.to_rawfile(filename, msgs=None, version=None, msg_headers=None, io_version=None)[source]¶ Write a list of bytestring messages to an hdf5 file. If the file exists, the messages will appended to the end of the dataset.
Parameters: - filename – desired filename for the file to write or update
- msgs – iterable of variable-length bytestrings to write to the file. If
Nonespecified, will only create file and update metadata. - version – a string of major.minor version desired. If
Nonespecified, will use the latest file format version (if new file) or version in file (if updating an existing file). - msg_headers – a dict of iterables to associate with each message header. Iterables must be same length as
msgs. IfNonespecified, will use a default value of0for each message. Keys are dtype field names specified indataset_dtypes[version]['msg_headers'].names - io_version – optional metadata to associate with file corresponding to the io format version of the bytestring messages. Throws
RuntimeErrorif version incompatibility encountered in an existing file.
-
larpix.format.rawhdf5format.len_rawfile(filename, attempts=1)[source]¶ Check the total number of messages in a file
Parameters: - filename – filename to check
- attempts – a parameter only relevant if file is being actively written to by another process, specifies number of refreshes to try if a synchronized state between the datasets is not achieved. A value less than
0busy blocks until a synchronized state is achieved. A value greater than0tries to achieve synchronization a max ofattemptsbefore throwing aRuntimeError. And a value of0does not attempt to synchronize (not recommended).
Returns: intnumber of messages in file
-
larpix.format.rawhdf5format.from_rawfile(filename, start=None, end=None, version=None, io_version=None, msg_headers_only=False, mask=None, attempts=1)[source]¶ Read a chunk of bytestring messages from an existing file
Parameters: - filename – filename to read bytestrings from
- start – index for the start position when reading from the file (default =
None). If a value less than 0 is specified, index is relative to the end of the file. IfNoneis specified, data is read from the start of the file. If amaskis specified, does nothing. - end – index for the end position when reading from the file (default =
None). If a value less than 0 is specified, index is relative to the end of the file. IfNoneis specified, data is read until the end of the file. If amaskis specified, does nothing. - version – required version compatibility. If
Nonespecified, uses the version stored in the file metadata - io_version – required io version compatibility. If
Nonespecified, does not check theio_versionfile metadata - msg_headers_only – optional flag to only load header information and not message bytestrings (
'msgs'value in return dict will beNoneifmsg_headers_only=True) - mask – boolean mask alternative to
startandendchunk specification to indicate specific file rows to load. Boolean 1D array with length equal tolen_rawfile(filename) - attempts – a parameter only relevant if file is being actively written to by another process, specifies number of refreshes to try if a synchronized state between the datasets is not achieved. A value less than
0busy blocks until a synchronized state is achieved. A value greater than0tries to achieve synchronization a max ofattemptsbefore throwing aRuntimeError. And a value of0does not attempt to synchronize (not recommended).
Returns: dictwith keys for'created','modified','version', and'io_version'metadata, along with'msgs'(alistof bytestring messages) and'msg_headers'(a dict with message header field name:listof message header field data, 1 per message)