Search

Home
Experience
Projects
Featured
Publications
Talks
Students
Tags
News

Light Dark Automatic

Keren Zhou

Latest

Profiling and Debugging GPU-accelerated AI Applications
Proton: Introduction and Development
Dev Tools: Proton/Interpreter
Triton Update
FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks
Update on Triton's Interpreter
Proton: A Profiler for Triton
Centimani: Enabling Fast AI Accelerator Selection for DNN Training with a Novel Performance Predictor
FASTEN: Fast GPU-accelerated Segmented Matrix Multiplication for Heterogenous Graph Neural Networks
PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode Transformation and Graph Compilation
Technical Review on PyTorch 2.0 and Triton
Hardware-Aware Compression with Random Operation Access Specific Tile (ROAST) Hashing
Towards Agile Development of Efficient Deep Learning Operators (Hardware Insights)
Towards Agile Development of Efficient Deep Learning Operators (Call for Contributions)
DrGPUM: Guiding Memory Optimization for GPU-Accelerated Applications
Semi-supervised learning for shale image segmentation with fast normalized cut loss
Towards Agile Development of Efficient Deep Learning Operators (Pre-MLIR)
Practical Performance Optimization for Deep Learning Applications
ValueExpert: Exploring Value Patterns in GPU-accelerated Applications
Accelerating High-order Stencils on GPUs
An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications
Low Overhead and Context Sensitive Profiling of GPU-Accelerated Applications
Paw-Net: Stacking Ensemble Deep Learning for Segmenting Scanning Electron Microscopy Images of Fine-grained Shale Samples
ValueExpert: Exploring Value Patterns in GPU-Accelerated Applications
Performance Measurement, Analysis, and Optimization of GPU-accelerated Applications
Analyzing GPU-accelerated Applications Using HPCToolkit
GPA: A GPU Performance Advisor Based on Instruction Sampling
GPA: A GPU Performance Advisor Based on Instruction Sampling
Measurement and Analysis of GPU-accelerated Applications with HPCToolkit
Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs
Outcomes of OpenMP Hackathon: OpenMP Application Experiences with the Offloading Model
GVProf: A Value Profiler for GPU-Based Clusters
Tools for Top-down Performance Analysis of GPU-Accelerated Applications
A Tool for Top-down Performance Analysis of GPU-accelerated Applications
A Tool for Top-down Performance Analysis of GPU-Accelerated Applications
GVPROF: A Value Profiler for GPU-Based Clusters
Tools for Top-down Performance Analysis of GPU-Accelerated Applications
Optimizing GPU-accelerated Applications with HPCToolkit
A Tool for Performance Analysis of GPU-accelerated Applications
A Tool for Performance Analysis of GPU-Accelerated Applications
Quadboost: A Scalable Concurrent Quadtree
A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability
Deep Learning on Modern Architectures
A Performance Analysis Framework for Exploiting GPU Microarchitectural Capability
Understanding the GPU Microarchitecture to Achieve Bare-Metal Performance Tuning
Convolution Methods
BF-MapReduce: A Bloom Filter Based Efficient Lightweight Search
Multi-Classes Feature Engineering with Sliding Window for Purchase Prediction in Mobile Commerce

© 2025 Keren Zhou. This work is licensed under CC BY NC ND 4.0

Published with Wowchemy — the free, open source website builder that empowers creators.

Cite