We have proposed a paper about GPU performance analysis. Beyond the previous work, we are going to extend our framework for wider applications, multiple kernels, and several architectures. Besides, designing user-friendly interfaces is also a primary goal.
Kepler GEMM and Convolution
Aug.2016 - Feb.2017
I wrote GEMM and Convolution functions using assembly codes for Kepler GPU and achieved better performance than cuBLAS (~20%) and cuDNN (~40%). We have a paper about GEMM design, in which I contributed to NT and NN implementations and adopted the functions into blitz. We also publish another paper that introduce the methodolgy we used for modeling GPU performance and analyzing program bottlenecks.
High Performance Neural Network
Nov.2015 - Dec.2016
I improved the performance of neural networks on several architectures. On CPU and MIC, I boosted the performance with vectorization and blocking techniques; on GPU, I wrote assembly codes to promote the instruction bandwidth and data reuse rate. The system is up to six times faster than Caffe and two times faster than MKL2017 on two sockets E5-2670v2.
I surveyed concurrent data structures and evaluated their performance. I published two papers: one is a general method for developing concurrent structures ; the other is a p2p indexing system that utilizes concurrent skiplist .
I also developed the first lock-free quadtree that achieves tremendous speedup comparing with traditional fine-grained lock versions. I wrote two technique reports to present the design [2, 3].