TrIMS

Last updated on Nov 18, 2019

Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the de-facto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMS is a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, application APIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24× speedup in latency for image classification models, up to 210× speedup for large models, and up to 8× system throughput improvement.

Cheng Li (李程)

PhD candidate in Computer Science

My research lies in the field of GPU-accelerated applications, with an emphasis on Deep Learning.

Talks

GTC 2019 - TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference

Mar 22, 2019 3:30 PM San Jose, CA

PDF Project