IntelLabs / HDFIT.ScriptsHPC

This repository contains an HPC (High Performance Computing) reliability benchmark, carrying out fault injection experiments on a variety of HPC applications, targeting BLAS (Basic Linear Algebra Subroutines) GEMM (GEneral Matrix Multiply) operations.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

DISCONTINUATION OF PROJECT

This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.

HDFIT.ScriptsHPC

This repository is part of the Hardware Design Fault Injection Toolkit (HDFIT). HDFIT enables end-to-end fault injection experiments and comprises additionally HDFIT.NetlistFaultInjector and HDFIT.SystolicArray.

HDFIT HPC Toolchain

This repository contains the main components of the HDFIT HPC reliability benchmark in order to carry out fault injection experiments on a variety of HPC applications, targeting either BLAS GEMM operations (using the proof-of-concept systolic array design implemented in HDFIT.SystolicArray) or generic floating-point FPU operations.

Directory Structure

The repository is structured in the following directories:

  • apps: contains code to clone the set of HPC applications supported for BLAS GEMM fault injection, as well as apply patches to them to enable HDFIT. This directory also contains application configurations that can be used to run experiments. Once compiled, the applications can be executed from this location.
  • apps_llvm: similar to the above, but focuses on LLVM-based instrumentation for a specific set of HPC applications, targeting generic FPU operations. Application configurations are included, while patching of the source codes is not required.
  • test: contains scripts to configure and run HDFIT fault injection experiments on the supported HPC applications. The scripts can be used both in a serial context, as well as on distributed HPC clusters for large-scale runs.
  • plot: contains a series of Python scripts that can be used to process the CSV files produced by HDFIT experiments, in order to generate useful plots and metrics.

For additional details about the components of the HDFIT HPC reliability benchmark, please refer to the README documents in each directory.

External Dependencies

The main external dependencies of the HPC reliability benchmark are the custom HDFIT OpenBLAS library supporting fault injection, as well as the custom HDFIT version of the LLTFI framework. Before compiling the HPC applications and running experiments, users need to point to the build directories of both, by setting the OPENBLAS_ROOT and LLTFI_ROOT variables in the config.mk file. Setting LLTFI_ROOT is required only for compiling and using the apps_llvm part of the reliability benchmark - for additional details please refer to the apps_llvm README document.

There are other dependencies required to compile the HPC applications and use the Python plotting scripts. These are make, cmake, autoconf, pkgconf, MPI and a functional gcc and gfortran toolchain for the former, plus Python 3 with the numpy, matplotlib and seaborn packages for the latter. The netCDF4 Python package is optionally required to perform experiments with the MiniWeather HPC application. In order to compile and use the applications in the apps_llvm directory, a functional LLVM toolchain (version 15 or above) is additionally required. More details can be found in the README documents in the apps, apps_llvm and plot directories respectively.

Getting Started

The basic process to run and analyze HDFIT fault injection tests comprises the following steps, considering as an example the GEMM-focused part of the reliability benchmark (apps directory) - the same steps can be applied for LLVM-based FPU fault injection (apps_llvm directory). First, the HPC applications need to be compiled:

cd apps && make all

Then, a test can be run - here we consider as example the CP2K application with the C2H4 input, performing by default 5k fault injection runs:

cd cp2k && ../../test/HDFIT_runner.sh CP2K-test-C2H4.env

This will eventually produce a out.C2H4 directory containing the experiment's results and a CSV summary. It should be noted that the output of each application run is not printed on the shell, but is directed to separate log files (e.g., out.C2H4/fi-transient/run10.log). The CSV summary file can be further fed into the HDFIT plotting scripts, for example to produce an SDE error curve:

cd out.C2H4 && python3 ../../../plot/HDFIT_plot_error_curve.py HDFIT-CP2K-C2H4-29.08.2022-transient.csv

This will produce an image file containing the desired plot, as well as display several statistical metrics. Further analysis can be conducted by using the output files resulting from each application run under fault injection.

License Terms

All original code that is part of the HDFIT HPC reliability benchmark is released under the terms of the GNU Lesser General Public License (LGPL) version 3 or (at your option) any later version. This includes all files in the plot and test directories of this repository.

The patch files for the individual HPC applications, as well as the associated input configurations, are instead released under the terms of the respective original licenses. This includes all files under the apps/resources and apps_llvm/resources directories. A copy of each application's license is included.

About

This repository contains an HPC (High Performance Computing) reliability benchmark, carrying out fault injection experiments on a variety of HPC applications, targeting BLAS (Basic Linear Algebra Subroutines) GEMM (GEneral Matrix Multiply) operations.

License:GNU Lesser General Public License v3.0


Languages

Language:Python 62.2%Language:Shell 24.7%Language:Makefile 13.1%