Overview¶

The anomaly detection (ADM) module consists of three components: on-node anomaly detection (AD), parameter server (PS) and provenance database (ProvDB).

Anomaly detection module architecture — Anomaly detection (AD) module: on-node AD module and paramter server (PS).¶

As described by the diagram above, an instrumented application communicates trace information to an instance of the AD, whose role it is to decide whether a function execution was anomalous. The decision is based upon globally aggregated function statistics that are collected on the PS and kept in sync with the AD instances. The PS also fulfils the role of collecting global statistics (number of anomalies, various counters) to forward to the external visualization module.

Detailed information about each anomaly is collected by the AD instances and forwarded to the ProvDB, which can be queried both online and offline to obtain more information.

On-node AD Module¶

The on-node anomaly detection (AD) module (per application process) takes streamed trace data provided by the TAU-instrumented binary. Each AD instance parses the streamed trace data and maintains a function call stack along with any communication events (if available) and counters. The anomaly detection algorithm compares the function execution to those it has seen previously, from which a decision is made as to whether the execution is anomalous. Detailed “provenance” information is collected for every anomaly as well as a limited number of “normal” executions (kept for comparison purposes) and stored in a database. Any remaining trace data is periodically discarded. By focusing primarily on anomalous events, Chimbuko significantly reduces the trace data volume while keeping those events that are important for understanding performance issues.

Below we provide a brief overview of the steps performed by the AD in processing the data.

Parser¶

Currently, the trace data is streamed from a TAU-instrumented binary via ADIOS2. The streaming is performed in a “stepped” fashion, whereby trace data is collected over some short time period and then communicated to the AD instances, after which the next step begins. We will refer to these steps as “I/O steps” or “I/O frames” throughout this document. We provide a class, ADParser to connect to an ADIOS2 writer side and fetch necessary data for the performance analysis. The parser performs some rudimentary validation on the data, and, depending on the context might augment the data with additional tags; for example, when running a workflow with multiple distinct binaries, the parser appends a ‘program index’ to the data to allow the programs to be differentiated within Chimbuko.

Pre-processing¶

In the pre-processing step, the AD takes the data for the present I/O step and generates a call stack tree for each program index, rank and thread (See class ADEvent). Communication events, GPU kernel launches and counters that occurred during the function execution are associated with the calls and the inclusive (including child calls) and exclusive (excluding child calls) runtimes are computed.

Anomaly detection¶

With the preprocessed data the AD instance applies its anomaly detection algorithm to filter the data. In order to provide the most complete and robust information to the algorithm, function execution statistics computed locally are aggregated globally with the data on the Parameter Server (See ADOutlier) prior to performing the anomaly detection.

Anomaly detection algorithm¶

At present we use the following algorithm to detect anomalous function executions: An anomaly function call is a function call that has a longer (or shorter) execution time than a upper (or a lower) threshold. By default the exclusive execution time (excluding child calls) is used but the inclusive time can be used with the appropriate command line option to the AD. The thresholds are defined as follows:

\[\begin{split}threshold_{upper} = \mu_{i} + \alpha * \sigma_{i} \\ threshold_{lower} = \mu_{i} - \alpha * \sigma_{i}\end{split}\]

where \(\mu_{i}\) and \(\sigma_{i}\) are the mean and standard deviation of the execution time of a function \(i\), respectively, and \(\alpha\) is a control parameter (smaller values increase the number of anomalies detected while potentially increasing the number of false-positives).

(See ADOutlier and RunStats).

Advanced anomaly analysis¶

Histogram Based Outlier Score (HBOS) is a deterministic and non-parametric statistical anomaly detection algorithm. It is implemented as part of Chimbuko’s anomaly analysis module. HBOS is an unsupervised anomaly detection algorithm which scores data in linear time. It supports dynamic bin widths which ensures long-tail distributions of function executions are captured and global anomalies are detected better. HBOS normalizes the histogram and calculates the anomaly scores by taking inverse of estimated densities of function executions. The score is a multiplication of the inverse of the estimated densities given by the following Equation

\[HBOS_{i} = \log_{2} (1 / density_{i} + \alpha)\]

where \(i\) is the index of a particular a function execution and \(density_{i}\) is function execution probability. The offset \(\alpha\) is chosen such that the scores lie in the range \(0\to 100\). HBOS works in \(O(nlogn)\) using dynamic bin-width or in linear time \(O(n)\) using fixed bin width. After scoring, the top 1% of scores are filtered as anomalous function executions. This filter value can be set at runtime to adjust the density of detected anomalies.
Another algorithm is added into Chimbuko’s advanced anomaly analysis called the COPula based Outlier Detection (COPOD), which is a deterministic, parameter-free anomaly detection algorithm. It computes empirical copulas for each sample in the dataset. A copula defines the dependence structure between random variables. For each sample in the dataset, COPOD algorithm computes left-tail empirical copula from left-tail empirical cumulative distribution function, right-tail copula from right-tail empirical cumulative distribution function, and a skewness-corrected empirical copula using a skewness coefficient calculated from left-tail and right-tail empirical cumulative distribution functions. These three computed values are interpreted as left-tail, right-tail, and skewness-corrected probabilities, respectively. Lowest probability value results in largest negative-log value, which is the score assigned to the sample in the dataset. Samples with the highest scores in the dataset are tagged as anomalous.

(See ADOutlier, HbosParam and CopodParam).

Provenance data collection¶

Once anomalous and non-anomalous functions are tagged, Chimbuko walks the call stack and generates detailed provenance information of the anomalous executions and a select number of normal executions (for comparison). (See ADAnomalyProvenance) These data are sent to the centralized “provenance database” for later analysis.

Stream local vizualisation data¶

The visualization module displays various real-time statistics such as the number of anomalies per rank. This information is collected by the AD (cf. ADLocalFuncStatistics and ADLocalCounterStatistics) and is aggregated on the parameter server, from which it is sent periodically (via curl) to the visualization module. The visualization module is capable also of interacting with the provenance database to obtain detailed information on specific anomalies per user request.

Post-processing¶

Once the data have been processed the call stack for the present I/O step is discarded and Chimbuko moves onto processing the next step. In this way the amount of trace data maintained is dramatically reduced to just the provenance data and any statistics that we maintain.

Parameter Server¶

The parameter server (PS) provides two services:

Maintain global anomaly algorithm parameters to provide consistent and robust anomaly detection power over the on-node AD modules.
Keep a global view of workflow-level performance trace analysis results which is streamed to the visualization server and also stored in the provenance database.

Design¶

Simple parameter server architecture — Parameter server architecture¶

(C)lients (i.e. on-node AD modules) send requests with their locally-computed anomaly detection algorithm parameters to be aggregated with the global parameters and the updated parameters returned to the client. Network communication is performed using the ZeroMQ library and using Cereal for data serialization.

For handling requests from a large number of connected clients the parameter server uses a ROUTER-DEALER pattern whereby requests are sent to a Frontend router and distributed over thread (W)orkers via the Backend router in round-robin fashion.

For the task of updating parameters, scalability is ensured by having each worker write to a separate parameter object in the In-Mem DB. These are periodically aggregated by a background thread and the global model updated, typically once per second such that the AD instances receive the most up-to-date model as possible. Global statistics on anomalies and counters are compiled in a similar way using the data sent from the AD instances (cf. global_anomaly_stats and global_counter_stats) and also stored in this database.

A dedicated (S)treaming thread (cf. PSstatSender) is maintained that periodically sends the latest global statistics to the visualization server.

Anomaly ranking metrics¶

Two metrics are developed that are assigned to each outlier that allow the user to focus on the subset of anomalies that are most important: the anomaly score reflects how unlikely an anomaly is, and the anomaly severity reflects how important the anomaly is to the runtime of the application. PS includes these values in the provenance information and allow for the convenient sorting and filtering of the anomalies in post-analysis. We have tested to present the individual choice of these metrics in the online visualization module.

Provenance Database¶

The role of the provenance database is four-fold:

To store detailed information about anomalous function executions and also, for comparison, samples of normal executions. Stored in the anomaly database.
To store a wealth of metadata regarding the characteristics of the devices upon which the application is running. Stored in the anomaly database.
To store globally-aggregated function profile data and counter averages that reflect the overall running of the application. Stored in the global database.
To store the final AD model for each function. Stored in the global database.

The databases are implemented using Sonata which implements a remote database built on top of UnQLite, a server-less JSON document store database. Sonata is capable of furnishing many clients and allows for arbitrarily complex document retrieval via jx9 queries.

In order to improve parallelizability the anomaly database, which stores the provenance data and metadata collected from the AD instances, is sharded (split over multiple independent instances), with the AD instances assigned to a shared in a round-robin fashion based on their rank index. The database shared are organized into three collections:

anomalies : the anomalous function executions
normalexecs : the samples of normal function executions
metadata : the metadata describing the machine/devices

The global database exists as a single shard, and is written to at the very end of the run by the parameter server, which maintains the globally-aggregated statistics. This database comprises two collections:

func_stats : function profile information including statistics (mean, std. dev., etc) on the inclusive and exclusive runtimes as well as the number and frequency of anomalues.
counter_stats : averages of various counters collected over the run.
ad_model : the final AD model for each function.

The schemata for the contents of both database components are described here: Provenance Database Schema