Dez 6, 2022
CCES INTERNATIONAL WORKSHOP 2022
Pedro Henrique Di Francia Rosso
With the growth of High Performance Computing (HPC) systems usage, many ways to facilitate the use of HPC tools by the final users are proposed. In such systems, due to the high number of nodes, the probability of failures also increases, so Fault Tolerance (FT) must be addressed. This work leverages the OpenMP Target Library, that is a specialized OpenMP library that focuses on computation offloading for accelerators, called devices, such as GPUs. In this sense, recent works proposed extending the device concept of accelerators to also include nodes of a cluster, so instead of offloading computation to some accelerators in a single node, multiple nodes could be achieve using only OpenMP directives. We propose a FT model that retains the idea of easing the use by the final users of OpenMP Target Library, that is initially focused and tested with one of the cluster proposals, but with the intent to be expanded and turned generic for any accelerator implemented in the library. Using algorithms to detect failures and manage checkpointing/restart in distributed systems, taking account of the system states (e.g.: tasks execution, data distribution), our model is capable of automatic handling failures using checkpoint/restart in a gracious degrading way (continuing the execution with remaining nodes) while being almost fully transparent to the final users of the tool.