IO Schema¶
The output of Chimbuko’s AD module is serialized in JSON format. This includes the streaming output sent between the parameter server and the visualization module, and the contents of the provenance database. In this section we provide the schema for these data.
Provenance Database Schema¶
Main database¶
Below we describe the JSON schema for the anomalies and normalexecs collections of the main database component of the provenance database.
Function event schema¶
This section describes the JSON schema for the anomalies and normalexecs collections. The fields of the JSON object are bolded, and a brief description follows the colon (:).
Function execution “events” in Chimbuko are labeled by a unique (for each process) string of following form “$RANK:$IO_STEP:$IDX” (eg “0:12:225”), where RANK, IO_STEP and IDX are the MPI rank, the io step and an integer index, respectively, and $VAL indicates the numerical value of the variable VAL. We will refer to such a string as an “event label” below.
For the SSTD (original) algorithm, the algo_params field has the following format:
The schema for the gpu_location field is as follows:
and for the gpu_parent field:
Note that Tau considers a GPU device/context/stream much in the same way as a CPU thread, and assigns it a unique index. This index is the “thread index” for GPU events.
Metadata schema¶
Metadata are stored in the metadata collection in the following JSON schema:
Note that the tid (thread index) for metadata is usually 0, apart from for metadata associated with a GPU context/device/stream, for which the index is the virtual thread index assigned by Tau to the context/device/stream.
Global database¶
Below we describe the JSON schema for the func_stats and counter_stats collections of the global database component of the provenance database.
Function profile statistics schema¶
func_stats contains aggregated profile information for all functions. The JSON schema is as follows:
Counter statistics schema¶
The counter_stats collection has the following schema:
Parameter Server Streaming Output¶
Every IO frame the AD instances send three pieces of information to the pserver:
For every function execution in the IO frame the inclusive and exclusive runtime and the number of anomalies for this function. These are aggregated over all IO frames and ranks on the parameter server and represent the function profile.
The total number of anomalies detected in the IO frame.
Statistics on the values of each counter over the IO step (e.g. for a memory usage counter this would be the mean, std.dev., etc of the memory usage over the IO frame. These are aggregated over all IO frams and ranks on the parameter server.
The parameter server optionally sends data to an external webserver as JSON-formatted packets via http using libcurl at some fixed frequency (independent of the frequency of IO steps in the trace data collection). This communication is handled by the PSstatSender class. The data packet is a JSON object comprising two payloads: anomaly_stats and counter_stats. Note, counters are integer valued quantities that are typically hardware counters but include information on MPI communications packet sizes, etc. The data packet JSON object has the following schema:
Note that the anomaly_stats entry will only be present if data has been received from the AD instances since the last send, and the counter_stats array will only appear if counters have ever been collected.
The schema for the ‘anomaly_stats’ object is as follows: