Neural Networks on Silicon
My name is Fengbin Tu. I'm currently pursuing my Ph.D. degree with the Institute of Microelectronics, Tsinghua University, Beijing, China. For more informantion about me and my research, you can go to my homepage. One of my research interests is architecture design for deep learning. This is an exciting field where fresh ideas come out every day, so I'm collecting works on related topics. Welcome to join us!
Table of Contents

 2014: ASPLOS, MICRO
 2015: ISCA, ASPLOS, FPGA, DAC
 2016: ISSCC, ISCA, MICRO, HPCA, DAC, FPGA, ICCAD, DATE, ASPDAC, VLSI, FPL
 2017: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, FPGA, ICCAD, DATE, VLSI, FCCM, HotChips
 2018: ISSCC, ISCA, MICRO, HPCA, ASPLOS, DAC, FPGA, ICCAD, DATE, ASPDAC, VLSI, HotChips
 2019: ISSCC, ASPDAC
My Contributions
I'm working on energyefficient architecture design for deep learning. Some featured works are presented here. Hope my new works will come out soon in the near future.
[Jun. 2018] A retentionaware neural acceleration (RANA) framework has been designed, which strengthens DNN accelerators with refreshoptimized eDRAM to save total system energy. RANA includes three techniques from the training, scheduling, architecture levels respectively.
 RANA: Towards Efficient Neural Acceleration with RefreshOptimized Embedded DRAM. (ISCA'18)
 Training Level: A retentionaware training method is proposed to improve eDRAM's tolerable retention time with no accuracy loss. Bitlevel retention errors are injected during training, so the network' s tolerance to retention failures is improved. A higher tolerable failure rate leads to longer tolerable retention time, so more refresh can be removed.
 Scheduling Level: A system energy consumption model is built in consideration of computing energy, onchip buffer access energy, refresh energy and offchip memory access energy. RANA schedules networks in a hybrid computation pattern based on this model. Each layer is assigned with the computation pattern that costs the lowest energy.
 Architecture Level: RANA independently disables refresh to eDRAM banks based on their storing data's lifetime, saving more refresh energy. A programmable eDRAM controller is proposed to enable the above finegrained refresh controls.
[Apr. 2017] A deep convoultional neural network architecture (DNA) has been designed with 1~2 orders higher energy efficiency over the stateoftheart works. I'm trying to further improve the architecture for ultra lowpower compting.
 Deep Convolutional Neural Network Architecture with Reconfigurable Computation Patterns. (TVLSI popular paper)
 This is the first work to assign Input/Output/Weight Reuse to different layers of a CNN, which optimizes systemlevel energy consumption based on different CONV parameters.
 A 4level CONV engine is designed to to support different tiling parameters for higher resource utilization and performance.
 A layerbased scheduling framework is proposed to optimize both systemlevel energy efficiency and performance.
Conference Papers
This is a collection of conference papers that interest me. The emphasis is focused on, but not limited to neural networks on silicon. Papers of significance are marked in bold. My comments are marked in italic.
2014 ASPLOS
 DianNao: A SmallFootprint HighThroughput Accelerator for Ubiquitous MachineLearning. (CAS, Inria)
2014 MICRO
 DaDianNao: A MachineLearning Supercomputer. (CAS, Inria, Inner Mongolia University)
2015 ISCA
 ShiDianNao: Shifting Vision Processing Closer to the Sensor. (CAS, EPFL, Inria)
2015 ASPLOS
 PuDianNao: A Polyvalent Machine Learning Accelerator. (CAS, USTC, Inria)
2015 FPGA
 Optimizing FPGAbased Accelerator Design for Deep Convolutional Neural Networks. (Peking University, UCLA)
2015 DAC
 Reno: A HighlyEfficient Reconfigurable Neuromorphic Computing Accelerator Design. (Universtiy of Pittsburgh, Tsinghua University, San Francisco State University, Air Force Research Laboratory, University of Massachusetts.)
 Scalable Effort Classifiers for Energy Efficient Machine Learning. (Purdue University, Microsoft Research)
 Design Methodology for Operating in NearThreshold Computing (NTC) Region. (AMD)
 Opportunistic Turbo Execution in NTC: Exploiting the Paradigm Shift in Performance Bottlenecks. (Utah State University)
2016 DAC
 DeepBurning: Automatic Generation of FPGAbased Learning Accelerators for the Neural Network Family. (Chinese Academy of Sciences)
 Hardware generator: Basic buliding blocks for neural networks, and address generation unit (RTL).
 Compiler: Dynamic control flow (configurations for different models), and data layout in memory.
 Simply report their framework and describe some stages.
 CBrain: A Deep Learning Accelerator that Tames the Diversity of CNNs through Adaptive DataLevel Parallelization. (Chinese Academy of Sciences)
 Simplifying Deep Neural Networks for Neuromorphic Architectures. (Incheon National University)
 Dynamic EnergyAccuracy Tradeoff Using Stochastic Computing in Deep Neural Networks. (Samsung, Seoul National University, Ulsan National Institute of Science and Technology)
 Optimal Design of JPEG Hardware under the Approximate Computing Paradigm. (University of Minnesota, TAMU)
 PerformML: Performance Optimized Machine Learning by Platform and Content Aware Customization. (Rice University, UCSD)
 LowPower Approximate Convolution Computing Unit with DomainWall Motion Based “SpinMemristor” for Image Processing Applications. (Purdue University)
 CrossLayer Approximations for Neuromorphic Computing: From Devices to Circuits and Systems. (Purdue University)
 Switched by Input: Power Efficient Structure for RRAMbased Convolutional Neural Network. (Tsinghua University)
 A 2.2 GHz SRAM with High Temperature Variation Immunity for Deep Learning Application under 28nm. (UCLA, Bell Labs)
2016 ISSCC
 A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent IoE Systems. (KAIST)
 Eyeriss: An EnergyEfficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. (MIT, NVIDIA)
 A 126.1mW RealTime Natural UI/UX Processor with Embedded Deep Learning Core for LowPower Smart Glasses Systems. (KAIST)
 A 502GOPS and 0.984mW DualMode ADAS SoC with RNNFIS Engine for Intention Prediction in Automotive BlackBox System. (KAIST)
 A 0.55V 1.1mW ArtificialIntelligence Processor with PVT Compensation for Micro Robots. (KAIST)
 A 4Gpixel/s 8/10b H.265/HEVC Video Decoder Chip for 8K Ultra HD Applications. (Waseda University)
2016 ISCA
 Cnvlutin: IneffectualNeuronFree Deep Convolutional Neural Network Computing. (University of Toronto, University of British Columbia)
 EIE: Efficient Inference Engine on Compressed Deep Neural Network. (Stanford University, Tsinghua University)
 Minerva: Enabling LowPower, HighAccuracy Deep Neural Network Accelerators. (Harvard University)
 Eyeriss: A Spatial Architecture for EnergyEfficient Dataflow for Convolutional Neural Networks. (MIT, NVIDIA)
 Present an energy analysis framework.
 Propose an energyefficienct dataflow called Row Stationary, which considers three levels of reuse.
 Neurocube: A Programmable Digital Neuromorphic Architecture with HighDensity 3D Memory. (Georgia Institute of Technology, SRI International)
 Propose an architecture integrated in 3D DRAM, with a meshlike NOC in the logic layer.
 Detailedly describe the data movements in the NOC.
 ISAAC: A Convolutional Neural Network Accelerator with InSitu Analog Arithmetic in Crossbars. (University of Utah, HP Labs)
 An advance over ISAAC has been published in "Newton: Gravitating Towards the Physical Limits of Crossbar Acceleration" (IEEE Micro).
 A Novel Processinginmemory Architecture for Neural Network Computation in ReRAMbased Main Memory. (UCSB, HP Labs, NVIDIA, Tsinghua University)
 RedEye: Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision. (Rice University)
 Cambricon: An Instruction Set Architecture for Neural Networks. (Chinese Academy of Sciences, UCSB)
2016 DATE
 The Neuro Vector Engine: Flexibility to Improve Convolutional Network Efficiency for Wearable Vision. (Eindhoven University of Technology, Soochow University, TU Berlin)
 Propose an SIMD accelerator for CNN.
 Efficient FPGA Acceleration of Convolutional Neural Networks Using Logical3D Compute Array. (UNIST, Seoul National University)
 The compute tile is organized on 3 dimensions: Tm, Tr, Tc.
 NEURODSP: A MultiPurpose EnergyOptimized Accelerator for Neural Networks. (CEA LIST)
 MNSIM: Simulation Platform for MemristorBased Neuromorphic Computing System. (Tsinghua University, UCSB, Arizona State University)
 Accelerated Artificial Neural Networks on FPGA for Fault Detection in Automotive Systems. (Nanyang Technological University, University of Warwick)
 Significance Driven Hybrid 8T6T SRAM for EnergyEfficient Synaptic Storage in Artificial Neural Networks. (Purdue University)
2016 FPGA
 Going Deeper with Embedded FPGA Platform for Convolutional Neural Network. [Slides][Demo] (Tsinghua University, MSRA)
 The first work I see, which runs the entire flow of CNN, including both CONV and FC layers.
 Point out that CONV layers are computationalcentric, while FC layrers are memorycentric.
 The FPGA runs VGG16SVD without reconfiguring its resources, but the convolver can only support k=3.
 Dynamicprecision data quantization is creative, but not implemented on hardware.
 ThroughputOptimized OpenCLbased FPGA Accelerator for LargeScale Convolutional Neural Networks. [Slides] (Arizona State Univ, ARM)
 Spatially allocate FPGA's resources to CONV/POOL/NORM/FC layers.
2016 ASPDAC
 Design Space Exploration of FPGABased Deep Convolutional Neural Networks. (UC Davis)
 LRADNN: HighThroughput and EnergyEfficient Deep Neural Network Accelerator using Low Rank Approximation. (Hong Kong University of Science and Technology, Shanghai Jiao Tong University)
 Efficient Embedded Learning for IoT Devices. (Purdue University)
 ACR: Enabling Computation Reuse for Approximate Computing. (Chinese Academy of Sciences)
2016 VLSI
 A 0.3‐2.6 TOPS/W Precision‐Scalable Processor for Real‐Time Large‐Scale ConvNets. (KU Leuven)
 Use dynamic precision for different CONV layers, and scales down the MAC array's supply voltage at lower precision.
 Prevent memory fetches and MAC operations based on the ReLU sparsity.
 A 1.40mm2 141mW 898GOPS Sparse Neuromorphic Processor in 40nm CMOS. (University of Michigan)
 A 58.6mW RealTime Programmable Object Detector with MultiScale MultiObject Support Using Deformable Parts Model on 1920x1080 Video at 30fps. (MIT)
 A Machinelearning Classifier Implemented in a Standard 6T SRAM Array. (Princeton)
2016 ICCAD
 Efficient Memory Compression in Deep Neural Networks Using CoarseGrain Sparsification for Speech Applications. (Arizona State University)
 Memsqueezer: Rearchitecting the Onchip memory Subsystem of Deep Learning Accelerator for Embedded Devices. (Chinese Academy of Sciences)
 Caffeine: Towards Uniformed Representation and Acceleration for Deep Convolutional Neural Networks. (Peking University, UCLA, Falcon)
 Propose a uniformed convolutional matrixmultiplication representation for accelerating CONV and FC layers on FPGA.
 Propose a weightmajor convolutional mapping method for FC layers, which has good data reuse, DRAM access burst length and effective bandwidth.
 BoostNoC: Power Efficient NetworkonChip Architecture for Near Threshold Computing. (Utah State University)
 Design of PowerEfficient Approximate Multipliers for Approximate Artificial Neural Network. (Brno University of Technology, Brno University of Technology)
 Neural Networks Designing Neural Networks: MultiObjective HyperParameter Optimization. (McGill University)
2016 MICRO
 From HighLevel Deep Neural Models to FPGAs. (Georgia Institute of Technology, Intel)
 Develop a macro dataflow ISA for DNN accelerators.
 Develop handoptimized template designs that are scalable and highly customizable.
 Provide a Template Resource Optimization search algorithm to cooptimize the accelerator architecture and scheduling.
 vDNN: Virtualized Deep Neural Networks for Scalable, MemoryEfficient Neural Network Design. (NVIDIA)
 Stripes: BitSerial Deep Neural Network Computing. (University of Toronto, University of British Columbia)
 Introduce serial computation and reduced precision computation to neural network accelerator designs, enabling accuracy vs. performance tradeoffs.
 Design a bitserial computing unit to enable linear scaling the performance with precision reduction.
 CambriconX: An Accelerator for Sparse Neural Networks. (Chinese Academy of Sciences)
 NEUTRAMS: Neural Network Transformation and Codesign under Neuromorphic Hardware Constraints. (Tsinghua University, UCSB)
 FusedLayer CNN Accelerators. (Stony Brook University)
 Fuse multiple CNN layers (CONV+POOL) to reduce DRAM access for input/output data.
 Bridging the I/O Performance Gap for Big Data Workloads: A New NVDIMMbased Approach. (The Hong Kong Polytechnic University, NSF/University of Florida)
 A Patch Memory System For Image Processing and Computer Vision. (NVIDIA)
 An Ultra LowPower Hardware Accelerator for Automatic Speech Recognition. (Universitat Politecnica de Catalunya)
 Perceptron Learning for Reuse Prediction. (TAMU, Intel Labs)
 Train neural networks to predict reuse of cache blocks.
 A CloudScale Acceleration Architecture. (Microsoft Research)
 Reducing Data Movement Energy via Online Data Clustering and Encoding. (University of Rochester)
 The Microarchitecture of a Realtime Robot Motion Planning Accelerator. (Duke University)
 Chameleon: Versatile and Practical NearDRAM Acceleration Architecture for Large Memory Systems. (UIUC, Seoul National University)
2016 FPL
 A High Performance FPGAbased Accelerator for LargeScale Convolutional Neural Network. (Fudan University)
 Overcoming Resource Underutilization in Spatial CNN Accelerators. (Stony Brook University)
 Build multiple accelerators, each specialized for specific CNN layers, instead of a single accelerator with uniform tiling parameters.
 Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC. (Intel)
2016 HPCA
 A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs. (Nanyang Technological University, HKUST, Cornell University)
 TABLA: A Unified Templatebased Architecture for Accelerating Statistical Machine Learning. (Georgia Institute of Technology)
 Memristive Boltzmann Machine: A Hardware Accelerator for Combinatorial Optimization and Deep Learning. (University of Rochester)
2017 FPGA
 An OpenCL Deep Learning Accelerator on Arria 10. (Intel)
 Minimum bandwidth requirement: All the intermediate data in AlexNet's CONV layers are cached in the onchip buffer, so their architecture is computebound.
 Reduced operations: Winograd transformation.
 High usage of the available DSPs+Reduced computation > Higher performance on FPGA > Competitive efficiency vs. TitanX.
 ESE: Efficient Speech Recognition Engine for Compressed LSTM on FPGA. (Stanford University, DeepPhi, Tsinghua University, NVIDIA)
 FINN: A Framework for Fast, Scalable Binarized Neural Network Inference. (Xilinx, Norwegian University of Science and Technology, University of Sydney)
 Can FPGA Beat GPUs in Accelerating NextGeneration Deep Neural Networks? (Intel)
 Accelerating Binarized Convolutional Neural Networks with SoftwareProgrammable FPGAs. (Cornell University, UCLA, UCSD)
 Improving the Performance of OpenCLbased FPGA Accelerator for Convolutional Neural Network. (UWMadison)
 Frequency Domain Acceleration of Convolutional Neural Networks on CPUFPGA Shared Memory System. (USC)
 Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. (Arizona State University)
2017 ISSCC
 A 2.9TOPS/W Deep Convolutional Neural Network SoC in FDSOI 28nm for Intelligent Embedded Systems. (ST)
 DNPU: An 8.1TOPS/W Reconfigurable CNNRNN Processor for GeneralPurpose Deep Neural Networks. (KAIST)
 ENVISION: A 0.26to10TOPS/W SubwordParallel Computational AccuracyVoltageFrequencyScalable Convolutional Neural Network Processor in 28nm FDSOI. (KU Leuven)
 A 288µW Programmable DeepLearning Processor with 270KB OnChip Weight Storage Using NonUniform Memory Hierarchy for Mobile Intelligence. (University of Michigan, CubeWorks)
 A 28nm SoC with a 1.2GHz 568nJ/Prediction Sparse DeepNeuralNetwork Engine with >0.1 Timing Error Rate Tolerance for IoT Applications. (Harvard)
 A Scalable Speech Recognizer with DeepNeuralNetwork Acoustic Models and VoiceActivated Power Gating (MIT)
 A 0.62mW UltraLowPower ConvolutionalNeuralNetwork Face Recognition Processor and a CIS Integrated with AlwaysOn HaarLike Face Detector. (KAIST)
2017 HPCA
 FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. (Chinese Academy of Sciences)
 PipeLayer: A Pipelined ReRAMBased Accelerator for Deep Learning. (University of Pittsburgh, University of Southern California)
 Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures. (University of Florida)
 Satisfaction of CNN (SoC) is the combination of SoCtime, SoCaccuracy and energy consumption.
 The PCNN framework is composed of offline compilation and runtime management.
 Offline compilation: Generally optimizes runtime, and generates scheduling configurations for the runtime stage.
 Runtime management: Generates tuning tables through accuracy tuning, and calibrate accuracy+runtime (select the best tuning table) during the longterm execution.
 Supporting Address Translation for AcceleratorCentric Architectures. (UCLA)
2017 ASPLOS
 Tetris: Scalable and Efficient Neural Network Acceleration with 3D Memory. (Stanford University)
 Move accumulation operations close to the DRAM banks.
 Develop a hybrid partitioning scheme that parallelizes the NN computations over multiple accelerators.
 SCDCNN: HighlyScalable Deep Convolutional Neural Network using Stochastic Computing. (Syracuse University, USC, The City College of New York)
2017 ISCA
 Maximizing CNN Accelerator Efficiency Through Resource Partitioning. (Stony Brook University)
 An Extension of their FPL'16 paper.
 InDatacenter Performance Analysis of a Tensor Processing Unit. (Google)
 SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. (Purdue University, Intel)
 Propose a fullsystem (server node) architecture, focusing on the challenge of DNN training (intra and interlayer heterogeneity).
 SCNN: An Accelerator for Compressedsparse Convolutional Neural Networks. (NVIDIA, MIT, UC Berkeley, Stanford University)
 Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. (University of Michigan, ARM)
 Understanding and Optimizing Asynchronous LowPrecision Stochastic Gradient Descent. (Stanford)
 LogCA: A HighLevel Performance Model for Hardware Accelerators. (AMD, University of WisconsinMadison)
 APPROXNoC: A Data Approximation Framework for NetworkOnChip Architectures. (TAMU)
2017 FCCM
 Escher: A CNN Accelerator with Flexible Buffering to Minimize OffChip Transfer. (Stony Brook University)
 Customizing Neural Networks for Efficient FPGA Implementation.
 Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs.
 FPDNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTLHLS Hybrid Templates. (Peking University, HKUST, MSRA, UCLA)
 Computeinstensive part: RTLbased generalized matrix multiplication kernel.
 Layerspecific part: HLSbased control logic.
 Memoryinstensive part: Several techniques for lower DRAM bandwidth requirements.
 FPGA accelerated Dense Linear Machine Learning: A PrecisionConvergence Tradeoff.
 A Configurable FPGA Implementation of the Tanh Function using DCT Interpolation.
2017 DAC
 Deep^3: Leveraging Three Levels of Parallelism for Efficient Deep Learning. (UCSD, Rice)
 RealTime meets Approximate Computing: An Elastic Deep Learning Accelerator Design with Adaptive Tradeoff between QoS and QoR. (CAS)
 I'm not sure whether the proposed tuning scenario and direction are reasonable enough to find out feasible solutions.
 Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. (PKU, CUHK, SenseTime)
 HardwareSoftware Codesign of Highly Accurate, Multiplierfree Deep Neural Networks. (Brown University)
 A Kernel Decomposition Architecture for Binaryweight Convolutional Neural Networks. (KAIST)
 Design of An EnergyEfficient Accelerator for Training of Convolutional Neural Networks using FrequencyDomain Computation. (Georgia Tech)
 New Stochastic Computing Multiplier and Its Application to Deep Neural Networks. (UNIST)
 TIME: A Traininginmemory Architecture for Memristorbased Deep Neural Networks. (THU, UCSB)
 FaultTolerant Training with OnLine Fault Detection for RRAMBased Neural Computing Systems. (THU, Duke)
 Automating the systolic array generation and optimizations for high throughput convolution neural network. (PKU, UCLA, Falcon)
 Towards FullSystem EnergyAccuracy Tradeoffs: A Case Study of An Approximate Smart Camera System. (Purdue)
 Synergistically tunes componetlevel approximation knobs to achieve systemlevel energyaccuracy tradeoffs.
 Error Propagation Aware Timing Relaxation For Approximate Near Threshold Computing. (KIT)
 RESPARC: A Reconfigurable and EnergyEfficient Architecture with Memristive Crossbars for Deep Spiking Neural Networks. (Purdue)
 Rescuing Memristorbased Neuromorphic Design with High Defects. (University of Pittsburgh, HP Lab, Duke)
 Group Scissor: Scaling Neuromorphic Computing Design to Big Neural Networks. (University of Pittsburgh, Duke)
 Towards Aginginduced Approximations. (KIT, UT Austin)
 SABER: Selection of Approximate Bits for the Design of Error Tolerant Circuits. (University of Minnesota, TAMU)
 On Quality Tradeoff Control for Approximate Computing using Iterative Training. (SJTU, CUHK)
2017 DATE
 DVAFS: Trading Computational Accuracy for Energy Through DynamicVoltageAccuracyFrequencyScaling. (KU Leuven)
 Acceleratorfriendly Neuralnetwork Training: Learning Variations and Defects in RRAM Crossbar. (Shanghai Jiao Tong University, University of Pittsburgh, Lynmax Research)
 A Novel Zero Weight/ActivationAware Hardware Architecture of Convolutional Neural Network. (Seoul National University)
 Solve the zeroinduced load imbalance problem.
 Understanding the Impact of Precision Quantization on the Accuracy and Energy of Neural Networks. (Brown University)
 Design Space Exploration of FPGA Accelerators for Convolutional Neural Networks. (Samsung, UNIST, Seoul National University)
 MoDNN: Local Distributed Mobile Computing System for Deep Neural Network. (University of Pittsburgh, George Mason University, University of Maryland)
 ChainNN: An EnergyEfficient 1D Chain Architecture for Accelerating Deep Convolutional Neural Networks. (Waseda University)
 LookNN: Neural Network with No Multiplication. (UCSD)
 Cluster weights and use LUT to avoid multiplication.
 EnergyEfficient Approximate Multiplier Design using Bit SignificanceDriven Logic Compression. (Newcastle University)
 Revamping Timing Error Resilience to Tackle Choke Points at NTC Systems. (Utah State University)
2017 VLSI
 A 3.43TOPS/W 48.9pJ/Pixel 50.1nJ/Classification 512 Analog Neuron Sparse Coding Neural Network with OnChip Learning and Classification in 40nm CMOS. (University of Michigan, Intel)
 BRein Memory: A 13Layer 4.2 K Neuron/0.8 M Synapse Binary/Ternary Reconfigurable InMemory Deep Neural Network Accelerator in 65 nm CMOS. (Hokkaido University, Tokyo Institute of Technology, Keio University)
 A 1.06To5.09 TOPS/W Reconfigurable HybridNeuralNetwork Processor for Deep Learning Applications. (Tsinghua University)
 A 127mW 1.63TOPS sparse spatiotemporal cognitive SoC for action classification and motion tracking in videos. (University of Michigan)
2017 ICCAD
 AEP: An Errorbearing Neural Network Accelerator for Energy Efficiency and Model Protection. (University of Pittsburgh)
 VoCaM: Visualization oriented convolutional neural network acceleration on mobile system. (George Mason University, Duke)
 AdaLearner: An Adaptive Distributed Mobile Learning System for Neural Networks. (Duke)
 MeDNN: A Distributed Mobile System with Enhanced Partition and Deployment for LargeScale DNNs. (Duke)
 TraNNsformer: Neural Network Transformation for Memristive Crossbar based Neuromorphic System Design. (Purdue).
 A Closedloop Design to Enhance Weight Stability of Memristor Based Neural Network Chips. (Duke)
 Fault injection attack on deep neural network. (CUHK)
 ORCHARD: Visual Object Recognition Accelerator Based on Approximate InMemory Processing. (UCSD)
2017 HotChips
 A Dataflow Processing Chip for Training Deep Neural Networks. (Wave Computing)
 Brainwave: Accelerating Persistent Neural Networks at Datacenter Scale. (Microsoft)
 DNN ENGINE: A 16nm SubuJ Deep Neural Network Inference Accelerator for the Embedded Masses. (Harvard, ARM)
 DNPU: An EnergyEfficient Deep Neural Network Processor with OnChip Stereo Matching. (KAIST)
 Evaluation of the Tensor Processing Unit (TPU): A Deep Neural Network Accelerator for the Datacenter. (Google)
 NVIDIA’s Volta GPU: Programmability and Performance for GPU Computing. (NVIDIA)
 Knights Mill: Intel Xeon Phi Processor for Machine Learning. (Intel)
 XPU: A programmable FPGA Accelerator for diverse workloads. (Baidu)
2017 MICRO
 BitPragmatic Deep Neural Network Computing. (NVIDIA, University of Toronto)
 CirCNN: Accelerating and Compressing Deep Neural Networks Using BlockCirculant Weight Matrices. (Syracuse University, City University of New York, USC, California State University, Northeastern University)
 Addressing Compute and Memory Bottlenecks for DNN Execution on GPUs. (University of Michigan)
 DRISA: A DRAMbased Reconfigurable InSitu Accelerator. (UCSB, Samsung)
 ScaleOut Acceleration for Machine Learning. (Georgia Tech, UCSD)
 Propose CoSMIC, a full computing stack constituting language, compiler, system software, template architecture, and circuit generators, that enable programmable acceleration of learning at scale.
 DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Nearcompute Data Fission. (Univ. of Michigan, Univ. of Nevada)
 Data Movement Aware Computation Partitioning. (PSU, TOBB University of Economics and Technology)
 Partition computation on a manycore system for near data processing.
2018 ASPDAC
 ReGAN: A Pipelined ReRAMBased Accelerator for Generative Adversarial Networks. (University of Pittsburgh, Duke)
 Acceleratorcentric Deep Learning Systems for Enhanced Scalability, Energyefficiency, and Programmability. (POSTECH)
 Architectures and Algorithms for User Customization of CNNs. (Seoul National University, Samsung)
 Optimizing FPGAbased Convolutional Neural Networks Accelerator for Image SuperResolution. (Sogang University)
 Running sparse and lowprecision neural network: when algorithm meets hardware. (Duke)
2018 ISSCC
 A 55nm TimeDomain MixedSignal Neuromorphic Accelerator with Stochastic Synapses and Embedded Reinforcement Learning for Autonomous MicroRobots. (Georgia Tech)
 A Shift Towards Edge MachineLearning Processing. (Google)
 QUEST: A 7.49TOPS MultiPurpose LogQuantized DNN Inference Engine Stacked on 96MB 3D SRAM Using InductiveCoupling Technology in 40nm CMOS. (Hokkaido University, Ultra Memory, Keio University)
 UNPU: A 50.6TOPS/W Unified Deep Neural Network Accelerator with 1bto16b FullyVariable Weight BitPrecision. (KAIST)
 A 9.02mW CNNStereoBased RealTime 3D HandGesture Recognition Processor for Smart Mobile Devices. (KAIST)
 An AlwaysOn 3.8μJ/86% CIFAR10 MixedSignal Binary CNN Processor with All Memory on Chip in 28nm CMOS. (Stanford, KU Leuven)
 ConvRAM: An EnergyEfficient SRAM with Embedded Convolution Computation for LowPower CNNBased Machine Learning Applications. (MIT)
 A 42pJ/Decision 3.12TOPS/W Robust InMemory Machine Learning Classifier with OnChip Training. (UIUC)
 BrainInspired Computing Exploiting Carbon Nanotube FETs and Resistive RAM: Hyperdimensional Computing Case Study. (Stanford, UC Berkeley, MIT)
 A 65nm 1Mb Nonvolatile ComputinginMemory ReRAM Macro with Sub16ns MultiplyandAccumulate for Binary DNN AI Edge Processors. (NTHU)
 A 65nm 4Kb AlgorithmDependent ComputinginMemory SRAM Unit Macro with 2.3ns and 55.8TOPS/W Fully Parallel ProductSum Operation for Binary DNN Edge Processors. (NTHU, TSMC, UESTC, ASU)
 A 1μW Voice Activity Detector Using Analog Feature Extraction and Digital Deep Neural Network. (Columbia University)
2018 HPCA
 Making Memristive Neural Network Accelerators Reliable. (University of Rochester)
 Towards Efficient Microarchitectural Design for Accelerating Unsupervised GANbased Deep Learning. (University of Florida)
 Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks. (POSTECH, NVIDIA, UTAustin)
 Insitu AI: Towards Autonomous and Incremental Deep Learning for IoT Systems. (University of Florida, Chongqing University, Capital Normal University)
 RCNVM: Enabling Symmetric Row and Column Memory Accesses for InMemory Databases. (PKU, NUDT, Duke, UCLA, PSU)
 GraphR: Accelerating Graph Processing Using ReRAM. (Duke, USC, Binghamton University SUNY)
 GraphP: Reducing Communication of PIMbased Graph Processing with Efficient Data Partition. (THU, USC, Stanford)
 PM3: Power Modeling and Power Management for ProcessinginMemory. (PKU)
2018 ASPLOS
 Bridging the Gap Between Neural Networks and Neuromorphic Hardware with A Neural Network Compiler. (Tsinghua, UCSB)
 MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. (Georgia Tech)
 Higher PE utilization: Use an augmented reduction tree (reconfigurable interconnects) to construct arbitrary sized virtual neurons.
 VIBNN: Hardware Acceleration of Bayesian Neural Networks. (Syracuse University, USC)
 Exploiting Dynamical Thermal Energy Harvesting for Reusing in Smartphone with Mobile Applications. (Guizhou University, University of Florida)
 Potluck: Crossapplication Approximate Deduplication for ComputationIntensive Mobile Applications. (Yale)
2018 VLSI
 STICKER: A 0.41‐62.1 TOPS/W 8bit Neural Network Processor with Multi‐Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers. (THU)
 2.9TOPS/W Reconfigurable Dense/Sparse Matrix‐Multiply Accelerator with Unified INT8/INT16/FP16 Datapath in 14nm Tri‐gate CMOS. (Intel)
 A Scalable Multi‐TeraOPS Deep Learning Processor Core for AI Training and Inference. (IBM)
 An Ultra‐high Energy‐efficient reconfigurable Processor for Deep Neural Networks with Binary/Ternary Weights in 28nm CMOS. (THU)
 B‐Face: 0.2 mW CNN‐Based Face Recognition Processor with Face Alignment for Mobile User Identification. (KAIST)
 A 141 uW, 2.46 pJ/Neuron Binarized Convolutional Neural Network based Selflearning Speech Recognition Processor in 28nm CMOS. (THU)
 A Mixed‐Signal Binarized Convolutional‐NeuralNetwork Accelerator Integrating Dense Weight Storage and Multiplication for Reduced Data Movement. (Princeton)
 PhaseMAC: A 14 TOPS/W 8bit GRO based Phase Domain MAC Circuit for In‐Sensor‐Computed Deep Learning Accelerators. (Toshiba)
2018 FPGA
 CLSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. (Peking Univ, Syracuse Univ, CUNY)
 DeltaRNN: A Powerefficient Recurrent Neural Network Accelerator. (ETHZ, BenevolentAI)
 Towards a Uniform Templatebased Architecture for Accelerating 2D and 3D CNNs on FPGA. (National Univ of Defense Tech)
 A Customizable Matrix Multiplication Framework for the Intel HARPv2 Xeon+FPGA Platform  A Deep Learning Case Study. (The Univ of Sydney, Intel)
 A Framework for Generating High Throughput CNN Implementations on FPGAs. (USC)
 Liquid Silicon: A DataCentric Reconfigurable Architecture enabled by RRAM Technology. (UW Madison)
2018 ISCA
 RANA: Towards Efficient Neural Acceleration with RefreshOptimized Embedded DRAM. (THU)
 Brainwave: A Configurable CloudScale DNN Processor for RealTime AI. (Microsoft)
 PROMISE: An EndtoEnd Design of a Programmable MixedSignal Accelerator for Machine Learning Algorithms. (UIUC)
 Computation Reuse in DNNs by Exploiting Input Similarity. (UPC)
 GANAX: A Unified SIMDMIMD Acceleration for Generative Adversarial Network. (Georiga Tech, IPM, Qualcomm, UCSD, UIUC)
 SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks. (UCSD, Georgia Tech, Qualcomm)
 UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition. (UIUC, NVIDIA)
 An EnergyEfficient Neural Network Accelerator based on OutlierAware Low Precision Computation. (Seoul National)
 Prediction based Execution on Deep Neural Networks. (Florida)
 Bit Fusion: BitLevel Dynamically Composable Architecture for Accelerating Deep Neural Networks. (Georgia Tech, ARM, UCSD)
 Gist: Efficient Data Encoding for Deep Neural Network Training. (Michigan, Microsoft, Toronto)
 The Dark Side of DNN Pruning. (UPC)
 Neural Cache: BitSerial InCache Acceleration of Deep Neural Networks. (Michigan)
 EVA^2: Exploiting Temporal Redundancy in Live Computer Vision. (Cornell)
 Euphrates: AlgorithmSoC CoDesign for LowPower Mobile Continuous Vision. (Rochester, Georgia Tech, ARM)
 FeatureDriven and Spatially Folded Digital Neurons for Efficient Spiking Neural Network Simulations. (POSTECH/Berkeley, Seoul National)
 SpaceTime Algebra: A Model for Neocortical Computation. (Wisconsin)
 Scaling Datacenter Accelerators With ComputeReuse Architectures. (Princeton)
 Add a NVMbased storage layer to the accelerator, for computation reuse.
 Enabling Scientific Computing on Memristive Accelerators. (Rochester)
2018 DATE
 MATIC: Learning Around Errors for Efficient LowVoltage Neural Network Accelerators. (University of Washington)
 Learn around errors resulting from SRAM voltage scaling, demonstrated on a fabricated 65nm test chip.
 Maximizing System Performance by Balancing Computation Loads in LSTM Accelerators. (POSTECH)
 Sparse matrix format that load balances computation, demonstrated for LSTMs.
 CCR: A Concise Convolution Rule for Sparse Neural Network Accelerators. (CAS)
 Decompose convolution into multiple dense and zero kernels for sparsity savings.
 Block Convolution: Towards MemoryEfficient Inference of LargeScale CNNs on FPGA. (CAS)
 moDNN: Memory Optimal DNN Training on GPUs. (University of Notre Dame, CAS)
 HyperPower: Power and MemoryConstrained HyperParameter Optimization for Neural Networks. (CMU, Google)
2018 DAC
 CompensatedDNN: Energy Efficient LowPrecision Deep Neural Networks by Compensating Quantization Errors. (Best Paper, Purdue, IBM)
 Introduce a new fixedpoint representation, Fixed Point with Error Compensation (FPEC): Computation bits, +compensation bits that represent quantization error.
 Propose a lowoverhead sparse compensation scheme to estimate the error in MAC design.
 Calibrating Process Variation at System Level with InSitu LowPrecision Transfer Learning for Analog Neural Network Processors. (THU)
 DPS: Dynamic Precision Scaling for Stochastic ComputingBased Deep Neural Networks. (UNIST)
 DyHardDNN: Even More DNN Acceleration With Dynamic Hardware Reconfiguration. (Univ. of Virginia)
 Exploring the Programmability for Deep Learning Processors: from Architecture to Tensorization. (Univ. of Washington)
 LCP: Layer Clusters Paralleling Mapping Mechanism for Accelerating Inception and Residual Networks on FPGA. (THU)
 A Kernel Decomposition Architecture for Binaryweight Convolutional Neural Networks. (THU)
 Ares: A Framework for Quantifying the Resilience of Deep Neural Networks. (Harvard)
 ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators (New York Univ., IIT Kanpur)
 Loom: Exploiting Weight and Activation Precisions to Accelerate Convolutional Neural Networks. (Univ. of Toronto)
 Parallelizing SRAM Arrays with Customized BitCell for Binary Neural Networks. (Arizona)
 ThermalAware Optimizations of ReRAMBased Neuromorphic Computing Systems. (Northwestern Univ.)
 SNrram: An Efficient Sparse Neural Network Computation Architecture Based on Resistive RandomAccess Memory. (THU, UCSB)
 Long Live TIME: Improving Lifetime for TrainingInMemory Engines by Structured Gradient Sparsification. (THU, CAS, MIT)
 BandwidthEfficient Deep Learning. (MIT, Stanford)
 CoDesign of Deep Neural Nets and Neural Net Accelerators for Embedded Vision Applications. (Berkeley)
 SignMagnitude SC: Getting 10X Accuracy for Free in Stochastic Computing for Deep Neural Networks. (UNIST)
 DrAcc: A DRAM Based Accelerator for Accurate CNN Inference. (National Univ. of Defense Technology, Indiana Univ., Univ. of Pittsburgh)
 OnChip Deep Neural Network Storage With MultiLevel eNVM. (Harvard)
 VRLDRAM: Improving DRAM Performance via Variable Refresh Latency. (Drexel Univ., ETHZ)
2018 HotChips
 ARM's First Generation ML Processor. (ARM)
 The NVIDIA Deep Learning Accelerator. (NVIDIA)
 Xilinx Tensor Processor: An Inference Engine, Network Compiler + Runtime for Xilinx FPGAs. (Xilinx)
 Tachyum Cloud Chip for Hyperscale workloads, deep ML, general, symbolic and bio AI. (Tachyum)
 SMIV: A 16nm SoC with Efficient and Flexible DNN Acceleration for Intelligent IoT Devices. (ARM)
 NVIDIA's Xavier SystemonChip. (NVIDIA)
 Xilinx Project Everest: HW/SW Programmable Engine. (Xilinx)
2018 ICCAD
 Tetris: Rearchitecting Convolutional Neural Network Computation for Machine Learning Accelerators. (CAS)
 3DICT: A Reliable and QoS Capable Mobile ProcessInMemory Architecture for Lookupbased CNNs in 3D XPoint ReRAMs. (Indiana   University Bloomington, Florida International Univ.)
 TGPA: TileGrained Pipeline Architecture for Low Latency CNN Inference. (PKU, UCLA, Falcon)
 NID: Processing Binary Convolutional Neural Network in Commodity DRAM. (KAIST)
 AdaptivePrecision Framework for SGD using Deep QLearning. (PKU)
 Efficient Hardware Acceleration of CNNs using Logarithmic Data Representation with Arbitrary logbase. (Robert Bosch GmbH)
 CGOOD: Ccode Generation Framework for Optimized Ondevice Deep Learning. (SNU)
 Mixed Size Crossbar based RRAM CNN Accelerator with Overlapped Mapping Method. (THU)
 FCNEngine: Accelerating Deconvolutional Layers in Classic CNN Processors. (HUT, CAS, NUS)
 DNNBuilder: an Automated Tool for Building HighPerformance DNN Hardware Accelerators for FPGAs. (UIUC)
 DIMA: A Depthwise CNN InMemory Accelerator. (Univ. of Central Florida)
 EMAT: An Efficient MultiTask Architecture for Transfer Learning using ReRAM. (Duke)
 FATE: Fast and Accurate Timing Error Prediction Framework for Low Power DNN Accelerator Design. (NYU)
 Designing Adaptive Neural Networks for EnergyConstrained Image Classification. (CMU)
 Watermarking Deep Neural Networks for Embedded Systems. (UCLA)
 Defensive Dropout for Hardening Deep Neural Networks under Adversarial Attacks. (Northeastern Univ., Boston Univ., Florida International Univ.)
 A CrossLayer Methodology for Design and Optimization of Networks in 2.5D Systems. (Boston Univ., UCSD)
2018 MICRO
 Addressing Irregularity in Sparse Neural Networks: A Cooperative Software/Hardware Approach. (USTC, CAS)
 Diffy: a Deja vuFree Differential Deep Neural Network Accelerator. (University of Toronto)
 Beyond the Memory Wall: A Case for Memorycentric HPC System for Deep Learning. (KAIST)
 Towards Memory Friendly LongShort Term Memory Networks (LSTMs) on Mobile GPUs. (University of Houston, Capital Normal University)
 A NetworkCentric Hardware/Algorithm CoDesign to Accelerate Distributed Training of Deep Neural Networks. (UIUC, THU, SJTU, Intel, UCSD)
 PermDNN: Efficient Compressed Deep Neural Network Architecture with Permuted Diagonal Matrices. (City University of New York, University of Minnesota, USC)
 GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware. (Georgia Tech)
 ProcessinginMemory for Energyefficient Neural Network Training: A Heterogeneous Approach. (UCM, UCSD, UCSC)
 Schedules computing resources provided by CPU and heterogeneous PIMs (fixedfunction logic + programmable ARM cores), to optimized energy efficiency and hardware utilization.
 LerGAN: A Zerofree, Low Data Movement and PIMbased GAN Architecture. (THU, University of Florida)
 Multidimensional Parallel Training of Winograd Layer through Distributed NearData Processing. (KAIST)
 Winograd is applied to training to extend traditional data parallelsim with a new dimension named intratile parallelism. With intratile parallelism, nodes ara dividied into several groups, and weight update communication only occurs independtly in the group. The method shows better scalability for training clusters, as the total commnication doesn't scale with the increasing of node count.
 SCOPE: A Stochastic Computing Engine for DRAMbased Insitu Accelerator. (UCSB, Samsung)
 Morph: Flexible Acceleration for 3D CNNbased Video Understanding. (UIUC)
 Interthread Communication in Multithreaded, Reconfigurable Coarsegrain Arrays. (Technion)
 An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware. (Cornell)
2019 ASPDAC
 An Nway group association architecture and sparse data group association load balancing algorithm for sparse CNN accelerators. (THU)
 TNPU: An Efficient Accelerator Architecture for Training Convolutional Neural Networks. (ICT)
 NeuralHMC: An Efficient HMCBased Accelerator for Deep Neural Networks. (University of Pittsburgh, Duke)
 P3M: A PIMbased Neural Network Model Protection Scheme for Deep Learning Accelerator. (ICT)
 GraphSAR: A SparsityAware ProcessinginMemory Architecture for LargeScale Graph Processing on ReRAMs. (Tsinghua, MIT, Berkely)
2019 ISSCC
 An 11.5TOPS/W 1024MAC Butterfly Structure DualCore SparsityAware Neural Processing Unit in 8nm Flagship Mobile SoC. (Samsung)
 A 20.5TOPS and 217.3GOPS/mm2 Multicore SoC with DNN Accelerator and Image Signal Processor Complying with ISO26262 for Automotive Applications. (Toshiba)
 An 879GOPS 243mW 80fps VGA Fully Visual CNNSLAM Processor for WideRange Autonomous Exploration. (Michigan)
 A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression. (KAIST)
 A 65nm 0.39to140.3TOPS/W 1to12b Unified NeuralNetwork Processor Using BlockCirculantEnabled TransposeDomain Acceleration with 8.1× Higher TOPS/mm2 and 6T HBSTTRAMBased 2D DataReuse Architecture. (THU, National Tsing Hua University, Northeastern University)
 A 65nm 236.5nJ/Classification Neuromorphic Processor with 7.5% Energy Overhead OnChip Learning Using Direct SpikeOnly Feedback. (SNU)
 LNPU: A 25.3TFLOPS/W Sparse DeepNeuralNetwork Learning Processor with FineGrained Mixed Precision of FP8FP16. (KAIST)
 A 1Mb Multibit ReRAM ComputingInMemory Macro with 14.6ns Parallel MAC Computing Time for CNNBased AI Edge Processors. (National Tsing Hua University)
 SandwichRAM: An EnergyEfficient InMemory BWN Architecture with PulseWidth Modulation. (Southeast University, Boxing Electronics, THU)
 A Twin8T SRAM ComputationInMemory Macro for MultipleBit CNNBased Machine Learning. (National Tsing Hua University, University of Electronic Science and Technology of China, ASU, Geogia Tech)
 A Reconfigurable RRAM Physically Unclonable Function Utilizing PostProcess Randomness Source with <6×106 Native Bit Error Rate. (THU, National Tsing Hua University, Geogia Tech)
 A 65nm 1.1to9.1TOPS/W HybridDigitalMixedSignal Computing Platform for Accelerating ModelBased and ModelFree Swarm Robotics. (Georgia Tech)
 A Compute SRAM with BitSerial Integer/FloatingPoint Operations for Programmable InMemory Vector Acceleration. (Michigan)
 AllDigital TimeDomain CNN Engine Using Bidirectional Memory Delay Lines for EnergyEfficient Edge Computing. (UT Austin)
Important Topics
This is a collection of papers on other important topics related to neural networks. Papers of significance are marked in bold. My comments are in marked in italic.
Tutorial and Survey
 Tutorial on Hardware Architectures for Deep Neural Networks. (MIT)
 A Survey of Neuromorphic Computing and Neural Networks in Hardware. (Oak Ridge National Lab)
 A Survey of FPGA Based Neural Network Accelerator. (Tsinghua)
 Toolflows for Mapping Convolutional Neural Networks on FPGAs: A Survey and Future Directions. (Imperial College London)
Benchmarks
 DAWNBench: An EndtoEnd Deep Learning Benchmark and Competition. (Stanford)
 MLPerf: A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms.
 Fathom: Reference Workloads for Modern Deep Learning Methods. (Harvard University)
 AlexNet: Imagenet Classification with Deep Convolutional Neural Networks. (University of Toronto, 2012 NIPS)
 Network in Network. (National University of Singapore, 2014 ICLR)
 ZFNet: Visualizing and Understanding Convolutional Networks. (New York University, 2014 ECCV)
 OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. (New York University, 2014 CVPR)
 VGG: Very Deep Convolutional Networks for LargeScale Image Recognition. (Univerisity of Oxford, 2015 ICLR)
 GoogLeNet: Going Deeper with Convolutions. (Google, University of North Carolina, University of Michigan, 2015 CVPR)
 ResNet: Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification. (MSRA, 2015 ICCV)
 MobileNetV1: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. [code] (Google, 2017 CVPR)
 MobileNetV2: MobileNetV2: Inverted Residuals and Linear Bottlenecks. [code] (Google, 2018 CVPR)
Network Compression
 Neural Network Distiller: Intel's opensource Python package for neural network compression research. [Chinese]
Conference Papers
 Learning both Weights and Connections for Efficient Neural Network. (Stanford University, NVIDIA, 2015 NIPS)
 Prune connections by thresholding weight values.
 Retain accuracy with iterative retraining.
 Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. (Stanford University, Tsinghua University, 2016 ICLR)
 SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <0.5MB model size. (DeepScale & UC Berkeley, Stanford University)
 8Bit Approximations for Parallelism in Deep Learning. (Universia della Svizzera italiana, 2016 ICLR)
 Neural Networks with Few Multiplications. (Universite de Montreal, 2016 ICLR)
 Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. (Samsung, Seoul National University, 2016 ICLR)
 Hardwareoriented Approximation of Convolutional Neural Networks. (UC Davis, 2016 ICLR Workshop)
 Soft WeightSharing for Neural Network Compression. (University of Amsterdam, CIFAR, 2017 ICLR)
 Designing EnergyEfficient Convolutional Neural Networks using EnergyAware Pruning. (MIT, 2017 CVPR)
 Estimate the energy comsuption of a CNN based on their Eyeriss (ISCA'16) paper.
 Propose an energyaware pruning method.
 Scalable and Sustainable Deep Learning via Randomized Hashing. (Rice University, 2017 KDD)
 TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. [code] (Duke University, Hewlett Packard Labs, University of Nevada – Reno, University of Pittsburgh, 2017 NIPS)
 Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. (Intel, 2017 NIPS)
 Quantization and Training of Neural Networks for Efficient IntegerArithmeticOnly Inference. (Google, 2018 CVPR)
 A quantization scheme to improve the tradeoff between accuracy and ondevice latency, especially for MobileNet.
 Channel Pruning for Accelerating Very Deep Neural Networks. [code] (2017 ICCV)
 To prune, or not to prune: Exploring the Efficacy of Pruning for Model Compression. [code] (Google, 2018 ICLR Workshop)
 Compare the accuracy of "large, but prune" models (largesparse) and their "smaller, but dense" (smalldense) counterparts with identical memory footprint.
 For a given number of nonzero parameters, sparse MobileNets are able to outperform dense MobileNets.
 Training and Inference with Integers in Deep Neural Networks. [code] (Tsinghua, 2018 ICLR)
 A new method termed as "WAGE" to discretize both training and inference, where weights (W), activations (A), gradients (G) and errors (E) among layers are shifted and linearly constrained to lowbitwidth integers.
 Training in hardware systems such as integerbased deep learning accelerators and neuromorphic chips with comparable accuracy and higher energy efficiency, which is crucial to future AI applications in variable scenarios with transfer and continual learning demands.
 AMC: AutoML for Model Compression and Acceleration on Mobile Devices (CMU, Google, MIT, 2018 ECCV)
arXiv Papers
 ReducedPrecision Strategies for Bounded Memory in Deep Neural Nets. (University of Toronto, University of British Columbia)
 Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1.
 Constrain both the weights and the activations to either +1 or 1.
 Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. (Universite de Montreal)
 XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. [code] (Allen Institute for AI, University of Washington)
 DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. (Megvii)
 Deep Learning with Limited Numerical Precision. (IBM)
 Dynamic Network Surgery for Efficient DNNs. (Intel Labs China)
 Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights. [code] (Intel Labs China)
 Exploring the Regularity of Sparse Structure in Convolutional Neural Networks. (Stanford, NVIDIA, Tsinghua)
 Coarsergrained pruning can save memory storage and access while maintaining the accuracy.
 A QuantizationFriendly Separable Convolution for MobileNets. (Qualcomm)
Other Topics
GAN
 Generative Adversarial Nets. (Universite de Montreal, 2014 NIPS)
 Two "adversarial" MLP models G and D: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than G.
 D is trained to learn the above probability.
 G is trained to maximize the probability of D making a mistake..
NAS
Others
 You Only Look Once: Unified, RealTime Object Detection. [code] (University of Washington, Allen Institute for AI, Facebook AI Research, 2016 CVPR)
 AFastRCNN: Hard positive generation via adversary for object detection. (CMU, 2017 CVPR)
 Dilated Residual Networks. [code] (Princeton, Intel, 2017 CVPR)
 Deformable Convolutional Networks. [code] (MSRA, 2017 ICCV)
 ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. (Megvii, 2017 CVPR)
 Federated Optimization: Distributed Machine Learning for OnDevice Intelligence. (University of Edinburgh, Google)
 Deep Complex Networks. (Université de Montréal, Element AI)
 One Model To Learn Them All. (Google, University of Toronto)
 Densely Connected Convolutional Networks. (Cornell, Tsinghua, FAIR, 2017 CVPR)
 YOLO9000: Better, Faster, Stronger. [code] (University of Washington, 2017 CVPR)
Industry Contributions
 Movidius
 Myriad 2: Hardwareaccelerated visual intelligence at ultralow power.
 Fathom Neural Compute Stick: The world's first discrete deep learning accelerator (Myriad 2 VPU inside).
 Myriad X: Ondevice AI and computer vision.
 NVIDIA
 Jetson TX1: Embedded visual computing developing platform.
 DGX1: Deep learning supercomputer.
 Tesla V100: A data center GPU with Tensor Cores inside.
 NVDLA: The NVIDIA Deep Learning Accelerator (NVDLA) is a free and open architecture that promotes a standard way to design deep learning inference accelerators.
 Nervana
 Nervana Engine: Hardware optimized for deep learning.
 Wave Computing
 Clockless CGRA architecture.