Compiling and Optimizing OpenMP 4.X Programs to OpenCL and SPIR

Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an effective application speed-up. To achieve that, programmers are usually required to significantly change and adapt program structures and algorithms, thus impacting both performance and productivity. A simpler alternative is to use high-level directive-based programming models like OpenACC and OpenMP. These models allow programmers to insert both directives and runtime calls into existing source code, thus providing hints to the compiler and runtime to perform certain transformations and optimizations on the annotated code regions. In this paper, we present AClang, an open-source LLVM/Clang compiler framework (http://www.aclang.org) that implements the recently released OpenMP 4.X Accelerator Programming Model. AClang automatically converts OpenMP 4.X annotated program regions into OpenCL/SPIR kernels, while providing a set of polyhedral based op- timizations like tiling and vectorization. OpenCL kernels resulting from AClang can be executed on any OpenCL/SPIR compatible acceleration device, not only GPUs, but also FPGA accelerators like those found in the Intel HARP architecture. To the best of our knowledge and at the time this paper was written, this is the first LLVM/Clang implementa- tion of the OpenMP 4.X Accelerator Model that provides a source-to- target OpenCL conversion. Experiments using AClang on the Polybench benchmark reveal speed-ups of up to 30x on an Exynos 8890 Octacore CPU with a ARM Mali-T880 MP12 GPU, up to 62x on a 2.4 GHz dual- core Intel Core i5 processor equipped with an Intel Iris GPU unit, and up to 112x on a 2.1 GHz 32 cores Intel-Xeon processor equipped with a Tesla K40c GPU.

Full Article URL:

https://www.researchgate.net/publication/319138649_Compiling_and_Optimizing_Open…

Given their massively parallel computing capabilities heterogeneous architectures comprised of CPUs and accelerators have been increasingly used to speed-up scientific and engineering applications. Nevertheless, programming such architectures is a challenging task for most non-expert programmers as typical accelerator programming languages (e.g. CUDA and OpenCL) demand a thoroughly understanding of the underlying hardware to enable an effective application speed-up. To achieve that, programmers are usually required to significantly change and adapt program structures and algorithms, thus impacting both performance and productivity. A simpler alternative is to use high-level directive-based programming models like OpenACC and OpenMP. These models allow programmers to insert both directives and runtime calls into existing source code, thus providing hints to the compiler and runtime to perform certain transformations and optimizations on the annotated code regions. In this paper, we present AClang, an open-source LLVM/Clang compiler framework (http://www.aclang.org) that implements the recently released OpenMP 4.X Accelerator Programming Model. AClang automatically converts OpenMP 4.X annotated program regions into OpenCL/SPIR kernels, while providing a set of polyhedral based op- timizations like tiling and vectorization. OpenCL kernels resulting from AClang can be executed on any OpenCL/SPIR compatible acceleration device, not only GPUs, but also FPGA accelerators like those found in the Intel HARP architecture. To the best of our knowledge and at the time this paper was written, this is the first LLVM/Clang implementa- tion of the OpenMP 4.X Accelerator Model that provides a source-to- target OpenCL conversion. Experiments using AClang on the Polybench benchmark reveal speed-ups of up to 30x on an Exynos 8890 Octacore CPU with a ARM Mali-T880 MP12 GPU, up to 62x on a 2.4 GHz dual- core Intel Core i5 processor equipped with an Intel Iris GPU unit, and up to 112x on a 2.1 GHz 32 cores Intel-Xeon processor equipped with a Tesla K40c GPU.

Full Article URL:

Data Coherence Analysis and Optimization for Heterogenous Computing

Using Hardware-Transactional-Memory Support to Implement Thread-Level Speculation

Related posts

Discovery of New Zika Protease and Polymerase Inhibitors through the Open Science Collaboration Project OpenZika

Crea-SP awards PhD student Heitor Nigro Lopes from Center for Computing in Engineering & Sciences (CCES)

Post-Doctoral position: Dynamics and reaction mechanisms of carbohydrate-active enzymes