Publications

The Design and Implementation of a Scalable DL Benchmarking Platform

The current Deep Learning (DL) landscape is fast-paced and is rife with non-uniform models, hardware/software (HW/SW) stacks, but lacks …

DLSpec: A Deep Learning Task Exchange Specification

Deep Learning (DL) innovations are being introduced at a rapid pace. However, the current lack of standard specification of DL tasks …

Benanza: Automatic μBenchmark Generation to Compute ''Lower-bound'' Latency and Inform Optimizations of Deep Learning Models on GPUs

As Deep Learning (DL) models have been increasingly used in latency-sensitive applications, there has been a growing interest in …

XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs

There has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application …

DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

The past few years have seen a surge of applying Deep Learning (DL) models for a wide array of tasks such as image classification, …

MLModelScope: Evaluate and Introspect Cognitive Pipelines

The current landscape of cognitive pipelines exercises many Machine Learning (ML) and Deep Learning (DL) building blocks. These ML and …

TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function as a Service Environments

Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction …

Accelerating Reduction and Scan Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units …

Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects

Data-intensive applications such as machine learning and analytics have created a demand for faster interconnects to avert the memory …

Accelerating Reduction Using Tensor Core Units

Driven by deep learning, there has been a surge of specialized processors for matrix multiplication, referred to as Tensor Core Units …

SCOPE: C3SR Systems Characterization and Benchmarking Framework

This report presents the design of the Scope infrastructure for extensible and portable benchmarking. Improvements in high-performance …

Matrix Factorization on GPUs with Memory Optimization and Approximate Computing

Matrix factorization (MF) discovers latent features from observations, which has shown great promises in the fields of collaborative …

RAI: A Scalable Project Submission System for Parallel Programming Courses

A major component of many advanced programming courses is an open-ended “end-of-term project” assignment. Delivering and evaluating …

KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism

Dynamic parallelism on GPUs simplifies the programming of many classes of applications that generate parallelizable work not known …

DjiNN and Tonic: DNN as a Service and Its Implications for Future Warehouse Scale Computers

As applications such as Apple Siri, Google Now, Microsoft Cortana, and Amazon Echo continue to gain traction, web-service companies are …

Stochastic circuits for real-time image-processing applications

Real-time image-processing applications impose severe design constraints in terms of area and power. Examples of interest include …