A collection of research papers on efficient training of DNNs. If you find some ignored papers, please open issues or pull requests.
- [2021 | AAAI] Distribution Adaptive INT8 Quantization for Training CNNs [paper]
- [2021 | ICLR] CPT: Efficient Deep Neural Network Training via Cyclic Precision [paper] [code]
- [2021 | tinyML] TENT: Efficient Quantization of Neural Networks on the tiny Edge with Tapered FixEd PoiNT [paper]
- [2021 | arXiv] RCT: Resource Constrained Training for Edge AI [paper]
- [2021 | arXiv] A Simple and Efficient Stochastic Rounding Method for Training Neural Networks in Low Precision [paper]
- [2021 | arXiv] Enabling Binary Neural Network Training on the Edge [paper]
- [2021 | arXiv] In-Hindsight Quantization Range Estimation for Quantized Training [paper]
- [2021 | arXiv] Towards Efficient Full 8-bit Integer DNN Online Training on Resource-limited Devices without Batch Normalization [paper]
- [2021 | arXiv]Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update [paper]
- [2020 | Neural Networks] Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers [paper)] [code]
- [2020 | TC] Evaluations on Deep Neural Networks Training Using Posit Number System [paper]
- [2020 | CVPR] Towards Unified INT8 Training for Convolutional Neural Network [paper]
- [2020 | CVPR] Fixed-Point Back-Propagation Training [paper]
- [2020 | ICLR] Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks [paper]
- [2020 | ICML] Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs [paper]
- [2020 | IJCAI] Reducing Underflow in Mixed Precision Training by Gradient Scaling [paper]
- [2020 | NIPS] FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [paper] [code]
- [2020 | NIPS] Ultra-Low Precision 4-bit Training of Deep Neural Networks [paper]
- [2020 | NIPS] A Statistical Framework for Low-bitwidth Training of Deep Neural Networks [paper] [code]
- [2020 | arXiv] Adaptive Precision Training for Resource Constrained Devices [paper]
- [2020 | arXiv] Training and Inference for Integer-Based Semantic Segmentation Network [paper] [code]
- [2020 | arXiv] NITI: Training Integer Neural Networks Using Integer-only Arithmetic [paper] [code]
- [2020 | arXiv] Neural gradients are lognormally distributed: understanding sparse and quantized training [paper] [code]
- [2020 | arXiv] Exploring the Potential of Low-bit Training of Convolutional Neural Networks [paper]
- [2019 | JETCAS] FloatSD: A New Weight Representation and Associated Update Method for Efficient Convolutional Neural Network Training [paper]
- [2019 | ICLR] Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm [paper]
- [2019 | ICLR] Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks [paper]
- [2019 | ICML] SWALP: Stochastic Weight Averaging in Low-Precision Training [paper] [code]
- [2019 | NIPS] Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks [paper]
- [2019 | NIPS] Backprop with Approximate Activations for Memory-efficient Network Training [paper] [code]
- [2019 | NIPS] Dimension-Free Bounds for Low-Precision Training [paper]
- [2019 | arXiv] Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge [paper]
- [2019 | arXiv] Distributed Low Precision Training Without Mixed Precision [paper]
- [2019 | arXiv] Mixed Precision Training With 8-bit Floating Point [paper]
- [2019 | arXiv] A Study of BFLOAT16 for Deep Learning Training [paper]
- [2018 | ACL] Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq [paper]
- [2018 | ECCV] Value-aware Quantization for Training and Inference of Neural Networks [paper]
- [2018 | ICCD] Training Neural Networks with Low Precision Dynamic Fixed-Point [paper]
- [2018 | ICLR] Mixed Precision Training [paper]
- [2018 | ICLR] Training and Inference with Integers in Deep Neural Networks [paper] [code]
- [2018 | ICLR] Mixed Precision Training of Convolutional Neural Networks using Integer Operations [paper]
- [2018 | NIPS] Scalable Methods for 8-bit Training of Neural Networks [paper] [code]
- [2018 | NIPS] Training Deep Neural Networks with 8-bit Floating Point Numbers [paper]
- [2018 | NIPS] Training DNNs with Hybrid Block Floating Point [paper]
- [2018 | arXiv] High-Accuracy Low-Precision Training [paper]
- [2018 | arXiv] Low-Precision Floating-Point Schemes for Neural Network Training [paper]
- [2018 | arXiv] Training Deep Neural Network in Limited Precision [paper]
- [2017 | ICML] The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning [paper] [code]
- [2017 | IJCNN] FxpNet: Training a deep convolutional neural network in fixed-point representation [paper]
- [2017 | NIPS] Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks [paper]
- [2016 | arXiv] DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [paper] [code]
- [2016 | arXiv] Convolutional Neural Networks using Logarithmic Data Representation [paper]
- [2015 | ICLR] Training deep neural networks with low precision multiplications [paper]
- [2015 | ICML] Deep Learning with Limited Numerical Precision [paper]
- [2015 | arXiv] 8-Bit Approximations for Parallelism in Deep Learning [paper]
- [2014 | INTERSPEECH] 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs [paper]
- [2021 | IEEE Access] Roulette: A Pruning Framework to Train a Sparse Neural Network From Scratch [paper]
- [2021 | CVPR] The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models [paper] [code]
- [2021 | ICLR] Progressive Skeletonization: Trimming more fat from a network at initialization
- [2021 | ICLR] Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
- [2021 | IVLR] PRUNING NEURAL NETWORKS AT INITIALIZATION: WHY ARE WE MISSING THE MARK?
- [2021 | ICS] ClickTrain: Efficient and Accurate End-to-End Deep Learning Training via Fine-Grained Architecture-Preserving Pruning [paper] [code]
- [2021 | arXiv] Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks [paper] [code]
- [2021 | arXiv] Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [paper]
- [2021 | arXiv] FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training with Dynamic Sparsity [paper]
- [2020 | TCAD] Enabling On-Device CNN Training by Self-Supervised Instance Filtering and Error Map Pruning [paper] [code]
- [2020 | ECCV] Accelerating CNN Training by Pruning Activation Gradients [paper]
- [2020 | ICLR] Picking Winning Tickets Before Training by Preserving Gradient Flow [paper] [code]
- [2020 | ICLR] Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers [paper] [code]
- [2020 | ICLR] Drawing early-bird tickets: Towards more efficient training of deep networks [paper] [code]
- [2020 | MICRO] Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training [paper]
- [2020 | NIPS] Sparse Weight Activation Training [paper]
- [2020 | arXiv] Progressive Gradient Pruning for Classification, Detection and DomainAdaptation [paper] [code]
- [2020 | arXiv] Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks [paper] [code]
- [2020 | arXiv] Campfire: Compressible, Regularization-Free, Structured Sparse Training for Hardware Accelerators [paper] [code]
- [2019 | SysML] Full deep neural network training on a pruned weight budget [paper]
- [2019 | SC] PruneTrain: Fast Neural Network Training by Dynamic Sparse Model Reconfiguration [paper]
- [2018 | ICLR] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training [paper]
- [2017 | ICML] meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting [paper]
- [2021 | ICLR] Revisiting Locally Supervised Learning: an Alternative to End-to-end Training [paper] [code]
- [2021 | ICLR] Optimizer Fusion: Efficient Training with Better Locality and Parallelism [paper] [code]
- [2021 | MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling [paper]
- [2021 | arXiv] AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning [paper]
- [2020 | NIPS] Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures [paper] [code]
- [2020 | NIPS] TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning [paper]
- [2019 | ICML] Training Neural Networks with Local Error Signals [paper] [code]
- [2019 | ICML] Error Feedback Fixes SignSGD and other Gradient Compression Schemes [paper] [code]
- [2019 | NIPS] E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings [paper]
- [2019 | NIPS] AutoAssist: A Framework to Accelerate Training of Deep Neural Networks [paper] [code]
- [2018 | ICML] signSGD: Compressed Optimisation for Non-Convex Problems [paper] [code]
- [2017 | ICML] Understanding Synthetic Gradients and Decoupled Neural Interfaces [paper] [code]
- [2017 | NIPS] The Reversible Residual Network: Backpropagation Without Storing Activations [paper] [code]
- [2016 | ICML] Decoupled Neural Interfaces using Synthetic Gradients [paper] [code]
- [2016 | arXiv] Training Deep Nets with Sublinear Memory Cost [paper] [code]
- [2021 | OJSSC] An Overview of Energy-Efficient Hardware Accelerators for On-Device Deep-Neural-Network Training
- [2022 | ISCA] Anticipating and Eliminating Redundant Computations in Accelerated Sparse Training
- [2022 | TCAS-I] SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor With Dynamic Sub-Structured Weight Pruning
- [2022 | TCAS-I] TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning
- [2022 | HPCA] FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding
- [2022 | JSSC] A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling
- [2022 | ArXiv] EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
- [2021 | JSSC] HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching [paper]
- [2021 | JSSC] GANPU: An Energy-Efficient Multi-DNN Training Processor for GANs With Speculative Dual-Sparsity Exploitation [paper]
- [2021 | JSSC] A Neural Network Training Processor With 8-Bit Shared Exponent Bias Floating Point and Multiple-Way Fused Multiply-Add Trees
- [2021 | ISSCC] A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling [paper]
- [2021 | ISSCC] A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared Exponent Bias and 24-Way Fused Multiply-Add Tree [paper]
- [2021 | ISCA] RaPiD: AI Accelerator for Ultra-low Precision Training and Inference [paper]
- [2021 | ISCA] Cambricon-Q: A Hybrid Architecture for Efficient Training
- [2021 | ISCA] NASA: Accelerating Neural Network Design with a NAS Processor
- [2021 | ISCA] Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product
- [2021 | ISCAS] A 3.6 TOPS/W Hybrid FP-FXP Deep Learning Processor with Outlier Compensation for Image-to-image Application
- [2021 | VLSI] A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation
- [2021 | COOL] An Energy-Efficient Deep Neural Network Training Processor with Bit-Slice-Level Reconfigurability and Sparsity Exploitation
- [2021 | MICRO] FPRaker: A Processing Element For Accelerating Neural Network Training
- [2021 | MICRO] Equinox: Training (for Free) on a Custom Inference Accelerator
- [2021 | TC] A Deep Neural Network Training Architecture with Inference-aware Heterogeneous Data-type
- [2021 |TCAS-I] Memory Access Optimization for On-Chip Transfer Learning
- [2021 | TCAS-II] A 64.1mW Accurate Real-time Visual Object Tracking Processor with Spatial Early Stopping on Siamese Network
- [2020 | IEEE Access] Training Hardware for Binarized Convolutional Neural Network Based on CMOS Invertible Logic [paper]
- [2020 | JSSC] Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning [paper]
- [2020 | JSSC] DF-LNPU: A Pipelined Direct Feedback Alignment-Based Deep Neural Network Learning Processor for Fast Online Learning [paper]
- [2020 | JSSC] An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices [paper]
- [2020 | LSSC] PNPU: An Energy-Efficient Deep-Neural-Network Learning Processor With Stochastic Coarse–Fine Level Weight Pruning and Adaptive Input/Output/Weight Zero Skipping [paper]
- [2020 | TETC] SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference [paper]
- [2020 | DAC] SCA: A Secure CNN Accelerator for Both Training and Inference [paper]
- [2020 | DAC] Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training [paper]
- [2020 | DAC] SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training [paper]
- [2020 | DAC] A Pragmatic Approach to On-device Incremental Learning System with Selective Weight Updates [paper]
- [2020 | ISLPED] SparTANN: sparse training accelerator for neural networks with threshold-based sparsification [paper]
- [2020 | ISSCC] GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation [paper]
- [2020 | MICRO] Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training [paper]
- [2020 | MICRO] TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training [paper]
- [2020 | HPCA] SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training [paper]
- [2020 | VLSI] A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference [paper]
- [2020 | VLSI] A 146.52 TOPS/W Deep-Neural-Network Learning Processor with Stochastic Coarse-Fine Pruning and Adaptive Input/Output/Weight Skipping [paper]
- [2020 | arXiv] FPRaker: A Processing Element For Accelerating Neural Network Training [paper]
- [2020 | ISCAS] TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training [paper]
- [2019 | LSSC] A 2.6 TOPS/W 16-bit Fixed-Point Convolutional Neural Network Learning Processor in 65nm CMOS [paper]
- [2019 | LSSC] An Energy-Efficient Deep Reinforcement Learning Accelerator With Transposable PE Array and Experience Compression [paper]
- [2019 | LSSC] An Energy-Efficient Sparse Deep-Neural-Network Learning Accelerator with Fine-grained Mixed Precision of FP8-FP16 [paper]
- [2019 | TCAS-I] A Low-Power Deep Neural Network Online Learning Processor for Real-Time Object Tracking Application [paper]
- [2019 | ASPDAC] TNPU: an efficient accelerator architecture for training convolutional neural networks[paper]
- [2019 | ASSCC] A 2.25 TOPS/W Fully-Integrated Deep CNN Learning Processor with On-Chip Training [paper]
- [2019 | DAC] Acceleration of DNN Backward Propagation by Selective Computation of Gradients [paper]
- [2019 | DAC] An Optimized Design Technique of Low-bit Neural Network Training for Personalization on IoT Devices [paper]
- [2019 | ISSCC] LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 [paper]
- [2019 | SysML] Mini-batch Serialization: CNN Training with Inter-layer Data Reuse [paper]
- [2019 | VLSI] A 1.32 TOPS/W Energy Efficient Deep Neural Network Learning Processor with Direct Feedback Alignment based Heterogeneous Core Architecture [paper]
- [2018 | LSSC] A Scalable Multi-TeraOPS Core for AI Training and Inference [paper]
- [2018 | VLSI] A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference [paper]
- [2017 | DAC] Design of an Energy-Efficient Accelerator for Training of Convolutional Neural Networks using Frequency-Domain Computation [paper]
- [2017 | ISCA] SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks [paper]
- [2014 | MICRO] DaDianNao: A Machine-Learning Supercomputer [paper]
- [2022 | TNNLS] ETA: An Efficient Training Accelerator for DNNs Based on Hardware-Algorithm Co-Optimization
- [2021 | ICS] Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators [paper]
- [2020 | TC] A Neural Network-Based On-Device Learning Anomaly Detector for Edge Devices [paper]
- [2020 | ICCAD] FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory [paper]
- [2020 | IJCAI] Efficient and Modularized Training on FPGA for Real-time Applications [paper]
- [2020 | ISCAS] Training Progressively Binarizing Deep Networks Using FPGAs [paper]
- [2020 | FPL] Dynamically Growing Neural Network Architecture for Lifelong Deep Learning on the Edge [paper]
- [2019 | FPT] Training Deep Neural Networks in Low-Precision with High Accuracy Using FPGAs [paper]
- [2019 | NEWCAS] Efficient Hardware Implementation of Incremental Learning and Inference on Chip [paper]
- [2019 | FPL] FPGA-Based Training Accelerator Utilizing Sparseness of Convolutional Neural Network [paper]
- [2019 | FPL] Automatic Compiler Based FPGA Accelerator for CNN Training [paper]
- [2019 | FCCM] Towards Efficient Deep Neural Network Training by FPGA-Based Batch-Level Parallelism [paper]
- [2019 | FPGA] Compressed CNN Training with FPGA-based Accelerator [paper]
- [2018 | FCCM] FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters [paper]
- [2018 | FPL] A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing [paper]
- [2018 | FPL**] ClosNets: Batchless DNN Training with On-Chip a Priori Sparse Neural Topologies [paper]
- [2018 | ReConFig] A Highly Parallel FPGA Implementation of Sparse Neural Network Training [paper]
- [2018 | ISLPED] TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training [paper]
- [2017 | FPT] An FPGA-based processor for training convolutional neural networks [paper]
- [2017 | FPT] FPGA-based training of convolutional neural networks with a reduced precision floating-point library [paper]
- [2016 | ASAP] F-CNN: An FPGA-based framework for training Convolutional Neural Networks [paper]
- [2021 | VLSI] CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference
- [2021 | TC] AILC: Accelerate On-chip Incremental Learning with Compute-in-Memory Technology [paper]
- [2021 | TC] PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-efficient ReRAM [paper]
- [2019 | TC] A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets [paper]
- [2018 | TCAD] DeepTrain: A Programmable Embedded Platform for Training Deep Neural Networks [paper]
- [2017 | HPCA] PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning [paper]