Fault Tolerance in OpenMP Target Applications

Dez 6, 2022

CCES INTERNATIONAL WORKSHOP 2022

Pedro Henrique Di Francia Rosso

With the growth of High Performance Computing (HPC) systems usage, many ways to facilitate the use of HPC tools by the final users are proposed. In such systems, due to the high number of nodes, the probability of failures also increases, so Fault Tolerance (FT) must be addressed. This work leverages the OpenMP Target Library, that is a specialized OpenMP library that focuses on computation offloading for accelerators, called devices, such as GPUs. In this sense, recent works proposed extending the device concept of accelerators to also include nodes of a cluster, so instead of offloading computation to some accelerators in a single node, multiple nodes could be achieve using only OpenMP directives. We propose a FT model that retains the idea of easing the use by the final users of OpenMP Target Library, that is initially focused and tested with one of the cluster proposals, but with the intent to be expanded and turned generic for any accelerator implemented in the library. Using algorithms to detect failures and manage checkpointing/restart in distributed systems, taking account of the system states (e.g.: tasks execution, data distribution), our model is capable of automatic handling failures using checkpoint/restart in a gracious degrading way (continuing the execution with remaining nodes) while being almost fully transparent to the final users of the tool.

Post Views: 128

Sharing COVID-19 Data: Open Science and the FAPESP COVID-19 Data Repository

MassCCS: A high performance collision cross sections software for macro- molecules

Related posts

Coupled stretching-bending analysis of assembled thin laminated plate structures via boundary elements

Open Science – Challenges and Opportunities

Data Management Plans for Research