Search

Wen-mei Hwu

The Design and Implementation of a Scalable DL Benchmarking Platform
DLSpec: A Deep Learning Task Exchange Specification
Benanza: Automatic μBenchmark Generation to Compute ''Lower-bound'' Latency and Inform Optimizations of Deep Learning Models on GPUs
XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs
DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs
MLModelScope: Evaluate and Introspect Cognitive Pipelines
Accelerating Reduction and Scan Using Tensor Core Units
TrIMS: Transparent and Isolated Model Sharing for Low Latency Deep Learning Inference in Function as a Service Environments
Evaluating Characteristics of CUDA Communication Primitives on High-Bandwidth Interconnects
Accelerating Reduction Using Tensor Core Units
SCOPE: C3SR Systems Characterization and Benchmarking Framework
KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism