The cloud has become a popular environment for running Deep Learning (DL) applications. Public cloud providers charge by the amount of time the resources are actually used, with the price per hour depending on the configuration of the chosen cloud instance. Instances are usually provided in the form of a VM that gives access to a certain hardware configuration, and may also come with a pre-configured software environment. More advanced, and theoretically faster, VMs are usually more expensive, but may not necessarily provide the best performance for all applications. Therefore, in order to choose the best instance (or VM type), users must consider the relative performances (and consequent cost) of different VMs when running their specific target application. Taking this into account, we propose a model to estimate the relative performance and cost of training deep learning applications running in different VM instances. This model is built upon observations derived from the performance profile of executions of three different DL applications, on 12 different public cloud instances. We argue that this model is a valuable tool for cloud users looking for optimal VM types to train their deep learning applications on the cloud.
R. K. Tesser, A. Marques and E. Borin, “Selecting efficient VM types to train deep learning models on Amazon SageMaker,” 2021 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), 2021, pp. 20-27. https://doi.org/10.1109/SBAC-PADW53941.2021.00014