Song Han


Song Han is an assistant professor in the Electrical Engineering and Computer Science Department of the Massachusetts Institute of Technology (MIT). Dr. Han received the Ph.D. degree in Electrical Engineering from Stanford University advised by Prof. Bill Dally. Dr. Han co-founded DeePhi Tech in 2016, a startup offering efficient solutions for deep learning computing (deep compression and hardware acceleration).

Dr. Han's research focuses on energy-efficient deep learning, at the intersection between machine learning and computer architecture. He proposed Deep Compression that can compress deep neural networks by an order of magnitude without losing the prediction accuracy. He designed EIE: Efficient Inference Engine, a hardware accelerator that can perform inference directly on the compressed sparse model, which saves memory bandwidth and results in significant speedup and energy saving. His work has been featured by TheNextPlatform, TechEmergence, Embedded Vision and O’Reilly. His research efforts in model compression and hardware acceleration received the Best Paper Award at ICLR’16 and the Best Paper Award at FPGA’17, and these technologies also led to his startup DeePhi Tech. Before joining Stanford, Song graduated from Tsinghua University. [Google Scholar]

I joined MIT EECS as an assistant professor (MIT news). I am looking for PhD students interested in deep learning and computer architecture. Below are the missions of HAN's Lab:

H: High performance, High energy efficiency Hardware

A: Architectures and Accelerators for Artificial Intelligence

N: Novel algorithms for Neural Networks and deep learning

S: Small models, Scalable Systems, and Specialized Silicon

In the post-ImageNet era, computer vision and machine learning researchers are solving more complicated AI problems using larger data sets driving the demand for more computation.
However, we are in the post-Moore's Law world where the amount of computation per unit cost and power is no longer increasing at its historic rate. This mismatch between supply and demand for computation highlights the need for co-designing efficient machine learning algorithms and domain-specific hardware architectures.


  • July 2018: “AMC: AutoML for Model Compression and Acceleration on Mobile Devices” accepted by ECCV’18. This paper use AI to do model compression, rather than rely on human heuristics to do it. AMC can automate the model compression process, achieve better compression ratio, and also be more sample efficient. It takes shorter time can do better than rule-based heuristics.

  • June 2018: Song presents invited paper “Bandwidth Efficient Deep Learning” at Design Automation Conference (DAC’18). The paper talks about techniques to save memory bandwith, networking bandwidth, and engineer bandwdith for efficient deep learning.

  • May 2018: “Path-Level Network Transformation for Efficient Architecture Search” accepted by Internatinal Conference on Machine Learning (ICML’18).

  • Feb 26, 2018: Song presented “Bandwidth Efficient Deep Learning: Challenges and Trade-offs” at FPGA’18 panel session.

  • Jan 29, 2018: Deep Gradient Compression is accepted by ICLR’18. This technique can reduce the communication bandwidth by 500x and improves the scalability of large-scale distributed training. [slides].

  • Dec 6, 2017: Yi and Song presented “Fast-speed Intelligent Video Analytics” at NIPS 2017 demo session, Long Beach.

  • June 1, 2017: Song passed PhD defense. [video].

  • Feb 1 2017: Song presented “Efficient Methods and Hardware for Deep Learning” at Efficient Neural Network Summit, Cadence, San Jose.

  • Oct 28 2016: Song received Best Poster Award at 2016 Stanford Cloud Workshop for his poster entiled “Deep Compression, EIE and DSD: 
Deep Learning Model Compression, Acceleration, and Regularization”.

  • Sep 26 2016: Welcome Huizi and Chenzhuo joining CVA lab.

  • May 4 2016: Song received Best Paper Award in International Conference on Learning Representations (ICLR), San Juan, Puerto Rico.

Research Interest

I'm interested in application-driven, domain-specific computer architecture research. The end of Dennard scaling makes power become the key constraint. I'm interested in achieving higher efficiency by tailoring the architecture to characteristics of the application domain. My current research center around co-designing efficient algorithms and hardware systems for machine learning, to free AI from the power hungry hardware beasts and democratize AI to cheap mobile devices, and also reduce the cost of running deep learning on data centers. I enjoy the research intersections across machine learning algorithms, computer architecture and VLSI design.

Research Projects


Pruning & Sparse NN: Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resource. Conventional networks fix the architecture before training starts; as a result, training cannot improve the architecture. To address these limitations, we describe a method to reduce the storage and computation required by neural networks by an order of magnitude by learning only the important connections. This reduced the number of parameters of AlexNet by a factor of 9×, that of VGGNet by 13× without affecting their accuracy.

S. Han, J. Pool, J. Tran, W. J. Dally, “Learning both Weights and Connections for Efficient Neural Networks”, NIPS’15.
[pdf] [poster]


Deep Compression: Large deep neural network model improves prediction accuracy but results in large demand for memory access, which is 100× more power hungry than ALU operations. “Deep Compression” introduces a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of deep neural networks. Experimented on Imagenet dataset: AlexNet got compressed by 35×, from 240MB to 6.9MB; VGGNet got compressed by 49×, from 552MB to 11.3MB, without affecting their accuracy. This algorithm helps putting deep learning into mobile App.

S. Han, H. Mao, W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, ICLR’16. Best Paper Award.
[pdf] [model] [poster] [slides] [video]


Efficient Inference Engine (EIE): To execute DNNs on inexpensive, low-power embedded platform requires executing compressed, sparse DNNs. EIE is the first hardware accelerator for these highly-efficient networks. EIE exploits weight sparsity, weight sharing, and can skip zero activations from ReLU. Evaluated on nine DNN benchmarks, EIE is 189× and 13× faster, 24,000× and 3,000× more energy efficient than a CPU and GPU respectively. EIE both distributed storage and distributed computation to parallelize a sparsified layer across multiple PEs, which achieves load balance and good scalability. EIE is covered by TheNextPlatform, HackerNews, TechEmergence and Embedded Vision.

S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. Horowitz, W. J. Dally, “EIE: Efficient Inference Engine on Compressed Deep Neural Network”, ISCA’16.
[pdf] [slides] [video]


Efficient Speech Recognition Engine (ESE): ESE takes the approach of EIE one step further to address not only feedforward neural networks but also recurrent neural networks (RNN and LSTM). The recurrent nature produces complicated data dependency, which is more challenging than feedforward neural nets. To deal with this problem, we designed a data flow that can effectively schedule the complex LSTM operations using multiple EIE cores. ESE also present an effective model compression algorithm for LSTM with hardware efficiency considerations, compressed the LSTM by 20x without hurting accuracy. Implemented on Xilinx XCKU060 FPGA running at 200MHz, ESE has a processing power of 282 GOPS/s working directly on a compressed sparse LSTM network, corresponding to 2.52 TOPS/s on an uncompressed dense network.

S. Han, J. Kang, H. Mao, Y. Li, D. Xie, H. Luo, Y. Wang, H. Yang, W. J. Dally “ESE: Efficient Speech Recognition Engine for Compressed LSTM”, FPGA’17. Best Paper Award.
[pdf] [slides]


Dense-Sparse-Dense Training (DSD): A critical issue for training large neural networks is to prevent overfitting while at the same time providing enough model capacity. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks to achieve higher accuracy. DSD training can improve the prediction accuracy of a wide range of neural networks: CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition. DSD training flow produces the same model architecture and doesn't incur any inference time overhead.

S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, W. J. Dally, “DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow”, ICLR’17.
[DSD model zoo] [pdf] [slides]


Trained Tenary Quantization (TTQ): The deployment of large neural networks models can be difficult for mobile devices with limited power budgets. To solve this problem, we propose Trained Ternary Quantization (TTQ), a method that can reduce the precision of weights in neural networks to ternary values. This method has very little accuracy degradation and can even improve the accuracy of some models. We highlight our trained quantization method that can learn both ternary values and ternary assignment. During inference, our models are nearly 16× smaller than full-precision models.

C. Zhu, S. Han, H. Mao, W. J. Dally, “Trained Ternary Quantization”, ICLR’17.
[code] [pdf]


SqueezeNet: Smaller CNN model is easier to deploy on mobile devices. SqueezeNet is a small CNN architecture that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Together with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510× smaller than AlexNet), which can fully fit on-chip SRAM, making it easier to deploy on embedded device.

F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. J. Dally, K. Keutzer, “SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and < 0.5MB Model Size”, arXiv 16.
[model1] [model2] [pdf]


Pruning Winograd Convolution: Winograd’s minimal filtering algorithm and network pruning both reduce the operations in CNNs. Unfortunately, these two methods cannot be combined. We propose two modifications to Winograd-based CNNs to enable these methods to exploit sparsity. First, we prune the weights in the ”Winograd domain” to exploit static weight sparsity. Second, we move the ReLU operation into the ”Winograd domain” to improve the sparsity of the transformed activations. On CIFAR-10, our method reduces the number of multiplications in the VGG-nagadomi model by 10.2× with no loss of accuracy.

X. Liu, J. Pool, S. Han, W. J. Dally, “Efficient Sparse-winograd Convolutional Neural Networks”, ICLR’18.


Deep Gradient Compression: Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. In this paper, we find 99.9% of the gradient exchange in distributed SGD are redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. DGC achieves a gradient compression ratio from 270× to 600× without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet.

Y. Lin, S. Han, H. Mao, Y. Wang, W. J. Dally, “Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training”, ICLR’18.
[pdf] [slides].


  • Trained Tenary Quantization
    Chenzhuo Zhu, Song Han, Huizi Mao, William J. Dally
    International Conference on Learning Representations (ICLR), April 2017.

  • ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA
    Song Han, Junlong Kang, Huizi Mao, Yubin Li, Dongliang Xie, Hong Luo, Yu Wang, Huazhong Yang, William J. Dally
    NIPS workshop on Efficient Methods for Deep Neural Networks (EMDNN), Dec 2016, Best Paper Honorable Mention.
    International Symposium on Field-Programmable Gate Arrays (FPGA), Feb 2017, Best Paper Award.

Invited Talks

  • Efficient Methods and Hardware for Deep Learning

    • Faculty interview, MIT, Princeton, UC Berkely, UT Austin, etc., March 2017

    • PhD thesis defense, Stanford, June 2017

    • Invited talk, Tsinghua University, Peking University, Shanghai Jiaotong University, Fudan University, Zhejiang University, University of Science and Technology of China, China Academy of Science, September 2017

  • Deep Compression: A Deep Neural Network Compression Pipeline

    • Conference talk at ICLR, Puerto Rico, May 2016.

    • GPU Technology Conference (GTC), San Jose, March 2016.

    • Google, Mountain View, March 2015.

    • Stanford Computer System Colloquium, January 2016.

    • Baidu, Beijing, December 2015.

    • Huawei, Shanghai, December 2015.

    • Horizon Robotics, Beijing, December 2015.

  • EIE: Efficient Inference Engine on Compressed Deep Neural Network

    • Conference talk at ISCA, Korea, June 2016.

    • Movidius, San Mateo, April 2016.

    • HP Labs, Palo Alto, February 2016.

    • Apple, Cupertino, December 2015.

    • Huawei, Shanghai, December 2015.

    • HiScene, Shanghai, December 2015.

    • Stanford SystemX Fall Conference, Stanford, November 2015.


  • Ph.D. Stanford University, Sep. 2012 to Sep. 2017

  • M.S. Stanford University, Sep. 2012 to Jun. 2014

  • B.S. Tsinghua University, Aug. 2008 to Jul. 2012


  • TPC member for the 23rd IEEE Symposium on High Performance Computer Architecture

  • Reviewer for Journal of Machine Learning Research (JMLR)

  • Reviewer for IEEE Transactions on Neural Networks and Learning Systems (TNNLS)

  • Reviewer for Computer Vision and Image Understanding (CVIU)

  • Reviewer for IEEE Journal of Solid State Circuits (JSSCC)

  • Reviewer for IEEE Micro

  • Reviewer for IEEE Transactions on Computer-Aided Design of Integrated Circuits & Systems (TCAD)

  • Reviewer for ACM Journal on Emerging Technologies in Computing Systems (JETC)

  • Reviewer for IEEE Embedded Systems Letters (ESL)

  • Reviewer for 30th Annual Conference on Neural Information Processing Systems (NIPS)

  • Reviewer for 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)


  • Email: FirstnameLastname [at] mit [dot] edu

  • Office: 50 Vassar Street, Building 38-344, Cambridge, MA