Although heterogeneous computing has enabled impressive program speed-ups, knowledge about the architecture of the target device is still critical to reap full hardware benefits. Programming such architectures is complex and is usually done by means of specialized languages (e.g. CUDA, OpenCL). The cost of moving and keeping host/device data coherent may easily eliminate any performance gains achieved by acceleration. Although this problem has been extensively studied for multicore architectures and was recently tackled in discrete GPUs through CUDA8, no generic solution exists for integrated CPU/GPUs architectures like those found in mobile devices (e.g. ARM Mali). This paper proposes Data Coherence Analysis (DCA), a set of two data-flow analyses that determine how variables are used by host/device at each program point. It also introduces Data Coherence Optimization (DCO), a code optimization technique that uses DCA information to: (a) allocate OpenCL shared buffers between host and devices; and (b) insert appropriate OpenCL function calls into program points so as to minimize the number of data coherence operations. DCO was implemented in AClang LLVM (www.aclang.org) a compiler capable of translating OpenMP 4.X annotated loops to OpenCL kernels, thus hiding the complexity of directly programming in OpenCL. Experimental results using DCA and DCO in AClang to compile programs from the Parboil, Polybench and Rodinia benchmarks reveal performance speed-ups of up to 5.25x on an Exynos 8890 Octacore CPU with ARM Mali-T880 MP12 GPU and up to 2.03x on a 2.4 GHz dual-core Intel Core i5 processor equipped with an Intel Iris GPU unit.
Data Coherence Analysis and Optimization for Heterogeneous Computing Sousa, Rafael Cardoso Fernandes; Pereira, Marcio Machado; Pereira, Fernando Magno Quintao; Araujo, Guido; Computer Architecture and High Performance Computing (SBAC-PAD), 2017 29th International Symposium on , 9-16, 2017