LArPix raw HDF5 Format¶
This is an alternative to the ``larpix.format.hdf5format``format that allows for much faster conversion to file at the expense of human readability.
To use, pass a list of bytestring messages into the to_rawfile()
method:
msgs = [b'this is a test message', b'this is a different message']
to_rawfile('raw.h5', msgs)
To access the data in the file, the inverse method from_rawfile()
is used:
rd = from_rawfile('raw.h5')
rd['msgs'] # [b'this is a test message', b'this is a different message']
Messages may be recieved from multiple io_group
sources, in this case, a
per-message header with io_group
can be specified as a list of integers of
the same length as the msgs
list and passed into the file at the same time:
msgs = [b'message from 1', b'message from 2']
io_groups = [1, 2]
to_rawfile('raw.h5', msgs=msgs, msg_headers={'io_groups': io_groups})
rd = from_rawfile('raw.h5')
rd['msgs'] # [b'message from 1', b'message from 2']
rd['msg_headers']['io_groups'] # [1, 2]
File versioning¶
Some version validation is included with the file format through
the version
and io_version
file metadata. When creating a new file, a
file format version can be provided with the version
keyword argument as
a string formatted 'major.minor'
:
to_rawfile('raw_v0_0.h5', version='0.0')
Subsequent writes to the file will only occur if the requested file version and the existing file versions are compatible. Incompatiblity occurs if there is a difference in the major version number or the minor version number is less than the requested file version:
to_rawfile('raw_v0_0.h5', version='0.1') # fails due to minor version incompatibility
to_rawfile('raw_v0_0.h5', version='1.0') # fails due to major version incompatibility
By default, the most recent file version is used.
On the file read side, a version number can be requested and the file will be parsed assuming a specific version:
from_rawfile('raw_v0_0.h5', version='0.0')
from_rawfile('raw_v0_0.h5', version='0.1') # fails due to minor version incompatiblity
from_rawfile('raw_v0_0.h5', version='1.0') # fails due to major version compatibility
The io_version
optional metadata marks the version of the io message format
that was used to encode the message bytestrings. If present as a keyword
argument when writing to the file, an AssertionError
will be raised if
the io version is incompatible with the existing one stored in metadata:
to_rawfile('raw_io_v0_0.h5', io_version='0.0')
to_rawfile('raw_io_v0_0.h5', io_version='0.1') # fails due to minor version incompatibility
to_rawfile('raw_io_v0_0.h5', io_version='1.0') # fails due to major version incompatibility
A similar mechanism occurs when requesting an io version when reading from the file:
from_rawfile('raw_io_v0_0.h5', io_version='0.1') # fails due to minor version incompatibility
from_rawfile('raw_io_v0_0.h5', io_version='1.0') # fails due to major version
from_rawfile('raw_io_v0_0.h5', io_version='0.0')
I think it is worthwhile to further clarify the io_version
and the file
version
, as this might be confusing. In particular, you might be asking,
“What io versions are compatible with what file versions?” The rawhdf5format
is a way of wrapping raw binary data into a format that only requires HDF5 to
parse. The file version represents this HDF5 structuring (the hdf5 dataset
formats, file metadata, what message header data is available). Whereas the
io_version
represents the formatting of the binary data that the file
contains. So the answer to that question is: all file versions are compatible
with all io versions.
Converting to other file types¶
This format was created with a specific application in mind - provide a
temporary but fast file format for PACMAN messages. When used in this
case, to convert to the standard larpix.format.hdf5format
:
from larpix.format.pacman_msg_format import parse
from larpix.format.hdf5format import to_file
rd = from_rawfile('raw.h5')
pkts = list()
for io_group,msg in zip(rd['msg_headers']['io_groups'], rd['msgs']):
pkts.extend(parse(msg, io_group=io_group))
to_file('new_filename.h5', packet_list=pkts)
but as always, the most efficient means of accessing the data is to operate on the data itself, rather than converting between types.
Metadata (v0.0)¶
The group meta
contains file metadata stored as attributes:
created
:float
, unix timestamp since the 1970 epoch in seconds indicating when file was first createdmodified
:float
, unix timestamp since the 1970 epoch in seconds indicating when the file was last written toversion
:str
, file version, formatted as'major.minor'
io_version
:str
, optional version for message bytestring encoding, formatted as'major.minor'
Datasets (v0.0)¶
The hdf5 format contains two datasets msgs
and msg_headers
:
msgs
: shape(N,)
; variable-lengthuint1
arrays encoding each message bytestring
msg_headers
: shape(N,)
; numpy structured array with fields:
'io_group'
:uint1
representing theio_group
associated with each message
-
larpix.format.rawhdf5format.
latest_version
= '0.0'¶ Most up-to-date raw larpix hdf5 format version.
-
larpix.format.rawhdf5format.
dataset_dtypes
¶ Description of the datasets and their dtypes used in each version of the raw larpix hdf5 format.
Structured as
dataset_dtypes['<version>']['<dataset>'] = <dtype>
.
-
larpix.format.rawhdf5format.
to_rawfile
(filename, msgs=None, version=None, msg_headers=None, io_version=None)[source]¶ Write a list of bytestring messages to an hdf5 file. If the file exists, the messages will appended to the end of the dataset.
Parameters: - filename – desired filename for the file to write or update
- msgs – iterable of variable-length bytestrings to write to the file. If
None
specified, will only create file and update metadata. - version – a string of major.minor version desired. If
None
specified, will use the latest file format version (if new file) or version in file (if updating an existing file). - msg_headers – a dict of iterables to associate with each message header. Iterables must be same length as
msgs
. IfNone
specified, will use a default value of0
for each message. Keys are dtype field names specified indataset_dtypes[version]['msg_headers'].names
- io_version – optional metadata to associate with file corresponding to the io format version of the bytestring messages. Throws
RuntimeError
if version incompatibility encountered in an existing file.
-
larpix.format.rawhdf5format.
len_rawfile
(filename, attempts=1)[source]¶ Check the total number of messages in a file
Parameters: - filename – filename to check
- attempts – a parameter only relevant if file is being actively written to by another process, specifies number of refreshes to try if a synchronized state between the datasets is not achieved. A value less than
0
busy blocks until a synchronized state is achieved. A value greater than0
tries to achieve synchronization a max ofattempts
before throwing aRuntimeError
. And a value of0
does not attempt to synchronize (not recommended).
Returns: int
number of messages in file
-
larpix.format.rawhdf5format.
from_rawfile
(filename, start=None, end=None, version=None, io_version=None, msg_headers_only=False, mask=None, attempts=1)[source]¶ Read a chunk of bytestring messages from an existing file
Parameters: - filename – filename to read bytestrings from
- start – index for the start position when reading from the file (default =
None
). If a value less than 0 is specified, index is relative to the end of the file. IfNone
is specified, data is read from the start of the file. If amask
is specified, does nothing. - end – index for the end position when reading from the file (default =
None
). If a value less than 0 is specified, index is relative to the end of the file. IfNone
is specified, data is read until the end of the file. If amask
is specified, does nothing. - version – required version compatibility. If
None
specified, uses the version stored in the file metadata - io_version – required io version compatibility. If
None
specified, does not check theio_version
file metadata - msg_headers_only – optional flag to only load header information and not message bytestrings (
'msgs'
value in return dict will beNone
ifmsg_headers_only=True
) - mask – boolean mask alternative to
start
andend
chunk specification to indicate specific file rows to load. Boolean 1D array with length equal tolen_rawfile(filename)
- attempts – a parameter only relevant if file is being actively written to by another process, specifies number of refreshes to try if a synchronized state between the datasets is not achieved. A value less than
0
busy blocks until a synchronized state is achieved. A value greater than0
tries to achieve synchronization a max ofattempts
before throwing aRuntimeError
. And a value of0
does not attempt to synchronize (not recommended).
Returns: dict
with keys for'created'
,'modified'
,'version'
, and'io_version'
metadata, along with'msgs'
(alist
of bytestring messages) and'msg_headers'
(a dict with message header field name:list
of message header field data, 1 per message)