Awesome Efficient Training

A collection of research papers on efficient training of DNNs. If you find some ignored papers, please open issues or pull requests.

Algorithm

Quantization

[2021 | AAAI] Distribution Adaptive INT8 Quantization for Training CNNs [paper]
[2021 | ICLR] CPT: Efficient Deep Neural Network Training via Cyclic Precision [paper] [code]
[2021 | tinyML] TENT: Efficient Quantization of Neural Networks on the tiny Edge with Tapered FixEd PoiNT [paper]
[2021 | arXiv] RCT: Resource Constrained Training for Edge AI [paper]
[2021 | arXiv] A Simple and Efficient Stochastic Rounding Method for Training Neural Networks in Low Precision [paper]
[2021 | arXiv] Enabling Binary Neural Network Training on the Edge [paper]
[2021 | arXiv] In-Hindsight Quantization Range Estimation for Quantized Training [paper]
[2021 | arXiv] Towards Efficient Full 8-bit Integer DNN Online Training on Resource-limited Devices without Batch Normalization [paper]
[2021 | arXiv]Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update [paper]
[2020 | Neural Networks] Training High-Performance and Large-Scale Deep Neural Networks with Full 8-bit Integers [paper)] [code]
[2020 | TC] Evaluations on Deep Neural Networks Training Using Posit Number System [paper]
[2020 | CVPR] Towards Unified INT8 Training for Convolutional Neural Network [paper]
[2020 | CVPR] Fixed-Point Back-Propagation Training [paper]
[2020 | ICLR] Shifted and Squeezed 8-bit Floating Point format for Low-Precision Training of Deep Neural Networks [paper]
[2020 | ICML] Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs [paper]
[2020 | IJCAI] Reducing Underflow in Mixed Precision Training by Gradient Scaling [paper]
[2020 | NIPS] FracTrain: Fractionally Squeezing Bit Savings Both Temporally and Spatially for Efficient DNN Training [paper] [code]
[2020 | NIPS] Ultra-Low Precision 4-bit Training of Deep Neural Networks [paper]
[2020 | NIPS] A Statistical Framework for Low-bitwidth Training of Deep Neural Networks [paper] [code]
[2020 | arXiv] Adaptive Precision Training for Resource Constrained Devices [paper]
[2020 | arXiv] Training and Inference for Integer-Based Semantic Segmentation Network [paper] [code]
[2020 | arXiv] NITI: Training Integer Neural Networks Using Integer-only Arithmetic [paper] [code]
[2020 | arXiv] Neural gradients are lognormally distributed: understanding sparse and quantized training [paper] [code]
[2020 | arXiv] Exploring the Potential of Low-bit Training of Convolutional Neural Networks [paper]
[2019 | JETCAS] FloatSD: A New Weight Representation and Associated Update Method for Efficient Convolutional Neural Network Training [paper]
[2019 | ICLR] Per-Tensor Fixed-Point Quantization of the Back-Propagation Algorithm [paper]
[2019 | ICLR] Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks [paper]
[2019 | ICML] SWALP: Stochastic Weight Averaging in Low-Precision Training [paper] [code]
[2019 | NIPS] Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks [paper]
[2019 | NIPS] Backprop with Approximate Activations for Memory-efficient Network Training [paper] [code]
[2019 | NIPS] Dimension-Free Bounds for Low-Precision Training [paper]
[2019 | arXiv] Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for DNNs on the Edge [paper]
[2019 | arXiv] Distributed Low Precision Training Without Mixed Precision [paper]
[2019 | arXiv] Mixed Precision Training With 8-bit Floating Point [paper]
[2019 | arXiv] A Study of BFLOAT16 for Deep Learning Training [paper]
[2018 | ACL] Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq [paper]
[2018 | ECCV] Value-aware Quantization for Training and Inference of Neural Networks [paper]
[2018 | ICCD] Training Neural Networks with Low Precision Dynamic Fixed-Point [paper]
[2018 | ICLR] Mixed Precision Training [paper]
[2018 | ICLR] Training and Inference with Integers in Deep Neural Networks [paper] [code]
[2018 | ICLR] Mixed Precision Training of Convolutional Neural Networks using Integer Operations [paper]
[2018 | NIPS] Scalable Methods for 8-bit Training of Neural Networks [paper] [code]
[2018 | NIPS] Training Deep Neural Networks with 8-bit Floating Point Numbers [paper]
[2018 | NIPS] Training DNNs with Hybrid Block Floating Point [paper]
[2018 | arXiv] High-Accuracy Low-Precision Training [paper]
[2018 | arXiv] Low-Precision Floating-Point Schemes for Neural Network Training [paper]
[2018 | arXiv] Training Deep Neural Network in Limited Precision [paper]
[2017 | ICML] The ZipML Framework for Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning [paper] [code]
[2017 | IJCNN] FxpNet: Training a deep convolutional neural network in fixed-point representation [paper]
[2017 | NIPS] Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks [paper]
[2016 | arXiv] DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients [paper] [code]
[2016 | arXiv] Convolutional Neural Networks using Logarithmic Data Representation [paper]
[2015 | ICLR] Training deep neural networks with low precision multiplications [paper]
[2015 | ICML] Deep Learning with Limited Numerical Precision [paper]
[2015 | arXiv] 8-Bit Approximations for Parallelism in Deep Learning [paper]
[2014 | INTERSPEECH] 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs [paper]

Pruning

[2021 | IEEE Access] Roulette: A Pruning Framework to Train a Sparse Neural Network From Scratch [paper]
[2021 | CVPR] The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models [paper] [code]
[2021 | ICLR] Progressive Skeletonization: Trimming more fat from a network at initialization
[2021 | ICLR] Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch
[2021 | IVLR] PRUNING NEURAL NETWORKS AT INITIALIZATION: WHY ARE WE MISSING THE MARK?
[2021 | ICS] ClickTrain: Efficient and Accurate End-to-End Deep Learning Training via Fine-Grained Architecture-Preserving Pruning [paper] [code]
[2021 | arXiv] Accelerated Sparse Neural Training: A Provable and Efficient Method to Find N:M Transposable Masks [paper] [code]
[2021 | arXiv] Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [paper]
[2021 | arXiv] FreeTickets: Accurate, Robust and Efficient Deep Ensemble by Training with Dynamic Sparsity [paper]
[2020 | TCAD] Enabling On-Device CNN Training by Self-Supervised Instance Filtering and Error Map Pruning [paper] [code]
[2020 | ECCV] Accelerating CNN Training by Pruning Activation Gradients [paper]
[2020 | ICLR] Picking Winning Tickets Before Training by Preserving Gradient Flow [paper] [code]
[2020 | ICLR] Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers [paper] [code]
[2020 | ICLR] Drawing early-bird tickets: Towards more efficient training of deep networks [paper] [code]
[2020 | MICRO] Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training [paper]
[2020 | NIPS] Sparse Weight Activation Training [paper]
[2020 | arXiv] Progressive Gradient Pruning for Classification, Detection and DomainAdaptation [paper] [code]
[2020 | arXiv] Gradual Channel Pruning while Training using Feature Relevance Scores for Convolutional Neural Networks [paper] [code]
[2020 | arXiv] Campfire: Compressible, Regularization-Free, Structured Sparse Training for Hardware Accelerators [paper] [code]
[2019 | SysML] Full deep neural network training on a pruned weight budget [paper]
[2019 | SC] PruneTrain: Fast Neural Network Training by Dynamic Sparse Model Reconfiguration [paper]
[2018 | ICLR] Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training [paper]
[2017 | ICML] meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting [paper]

Others

[2021 | ICLR] Revisiting Locally Supervised Learning: an Alternative to End-to-end Training [paper] [code]
[2021 | ICLR] Optimizer Fusion: Efficient Training with Better Locality and Parallelism [paper] [code]
[2021 | MLSys] Wavelet: Efficient DNN Training with Tick-Tock Scheduling [paper]
[2021 | arXiv] AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning [paper]
[2020 | NIPS] Direct Feedback Alignment Scales to Modern Deep Learning Tasks and Architectures [paper] [code]
[2020 | NIPS] TinyTL: Reduce Memory, Not Parameters for Efficient On-Device Learning [paper]
[2019 | ICML] Training Neural Networks with Local Error Signals [paper] [code]
[2019 | ICML] Error Feedback Fixes SignSGD and other Gradient Compression Schemes [paper] [code]
[2019 | NIPS] E2-Train: Training State-of-the-art CNNs with Over 80% Energy Savings [paper]
[2019 | NIPS] AutoAssist: A Framework to Accelerate Training of Deep Neural Networks [paper] [code]
[2018 | ICML] signSGD: Compressed Optimisation for Non-Convex Problems [paper] [code]
[2017 | ICML] Understanding Synthetic Gradients and Decoupled Neural Interfaces [paper] [code]
[2017 | NIPS] The Reversible Residual Network: Backpropagation Without Storing Activations [paper] [code]
[2016 | ICML] Decoupled Neural Interfaces using Synthetic Gradients [paper] [code]
[2016 | arXiv] Training Deep Nets with Sublinear Memory Cost [paper] [code]

Hardware

Survey

[2021 | OJSSC] An Overview of Energy-Efficient Hardware Accelerators for On-Device Deep-Neural-Network Training

ASIC

[2022 | ISCA] Anticipating and Eliminating Redundant Computations in Accelerated Sparse Training
[2022 | TCAS-I] SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor With Dynamic Sub-Structured Weight Pruning
[2022 | TCAS-I] TSUNAMI: Triple Sparsity-Aware Ultra Energy-Efficient Neural Network Training Accelerator With Multi-Modal Iterative Pruning
[2022 | HPCA] FAST: DNN Training Under Variable Precision Block Floating Point with Stochastic Rounding
[2022 | JSSC] A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling
[2022 | ArXiv] EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators
[2021 | JSSC] HNPU: An Adaptive DNN Training Processor Utilizing Stochastic Dynamic Fixed-Point and Active Bit-Precision Searching [paper]
[2021 | JSSC] GANPU: An Energy-Efficient Multi-DNN Training Processor for GANs With Speculative Dual-Sparsity Exploitation [paper]
[2021 | JSSC] A Neural Network Training Processor With 8-Bit Shared Exponent Bias Floating Point and Multiple-Way Fused Multiply-Add Trees
[2021 | ISSCC] A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling [paper]
[2021 | ISSCC] A 40nm 4.81TFLOPS/W 8b Floating-Point Training Processor for Non-Sparse Neural Networks Using Shared Exponent Bias and 24-Way Fused Multiply-Add Tree [paper]
[2021 | ISCA] RaPiD: AI Accelerator for Ultra-low Precision Training and Inference [paper]
[2021 | ISCA] Cambricon-Q: A Hybrid Architecture for Efficient Training
[2021 | ISCA] NASA: Accelerating Neural Network Design with a NAS Processor
[2021 | ISCA] Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product
[2021 | ISCAS] A 3.6 TOPS/W Hybrid FP-FXP Deep Learning Processor with Outlier Compensation for Image-to-image Application
[2021 | VLSI] A 28nm 276.55TFLOPS/W Sparse Deep-Neural-Network Training Processor with Implicit Redundancy Speculation and Batch Normalization Reformulation
[2021 | COOL] An Energy-Efficient Deep Neural Network Training Processor with Bit-Slice-Level Reconfigurability and Sparsity Exploitation
[2021 | MICRO] FPRaker: A Processing Element For Accelerating Neural Network Training
[2021 | MICRO] Equinox: Training (for Free) on a Custom Inference Accelerator
[2021 | TC] A Deep Neural Network Training Architecture with Inference-aware Heterogeneous Data-type
[2021 |TCAS-I] Memory Access Optimization for On-Chip Transfer Learning
[2021 | TCAS-II] A 64.1mW Accurate Real-time Visual Object Tracking Processor with Spatial Early Stopping on Siamese Network
[2020 | IEEE Access] Training Hardware for Binarized Convolutional Neural Network Based on CMOS Invertible Logic [paper]
[2020 | JSSC] Evolver: A Deep Learning Processor With On-Device Quantization–Voltage–Frequency Tuning [paper]
[2020 | JSSC] DF-LNPU: A Pipelined Direct Feedback Alignment-Based Deep Neural Network Learning Processor for Fast Online Learning [paper]
[2020 | JSSC] An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices [paper]
[2020 | LSSC] PNPU: An Energy-Efficient Deep-Neural-Network Learning Processor With Stochastic Coarse–Fine Level Weight Pruning and Adaptive Input/Output/Weight Zero Skipping [paper]
[2020 | TETC] SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference [paper]
[2020 | DAC] SCA: A Secure CNN Accelerator for Both Training and Inference [paper]
[2020 | DAC] Prediction Confidence based Low Complexity Gradient Computation for Accelerating DNN Training [paper]
[2020 | DAC] SparseTrain: Exploiting Dataflow Sparsity for Efficient Convolutional Neural Networks Training [paper]
[2020 | DAC] A Pragmatic Approach to On-device Incremental Learning System with Selective Weight Updates [paper]
[2020 | ISLPED] SparTANN: sparse training accelerator for neural networks with threshold-based sparsification [paper]
[2020 | ISSCC] GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation [paper]
[2020 | MICRO] Procrustes: a Dataflow and Accelerator for Sparse Deep Neural Network Training [paper]
[2020 | MICRO] TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training [paper]
[2020 | HPCA] SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training [paper]
[2020 | VLSI] A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference [paper]
[2020 | VLSI] A 146.52 TOPS/W Deep-Neural-Network Learning Processor with Stochastic Coarse-Fine Pruning and Adaptive Input/Output/Weight Skipping [paper]
[2020 | arXiv] FPRaker: A Processing Element For Accelerating Neural Network Training [paper]
[2020 | ISCAS] TaxoNN: A Light-Weight Accelerator for Deep Neural Network Training [paper]
[2019 | LSSC] A 2.6 TOPS/W 16-bit Fixed-Point Convolutional Neural Network Learning Processor in 65nm CMOS [paper]
[2019 | LSSC] An Energy-Efficient Deep Reinforcement Learning Accelerator With Transposable PE Array and Experience Compression [paper]
[2019 | LSSC] An Energy-Efficient Sparse Deep-Neural-Network Learning Accelerator with Fine-grained Mixed Precision of FP8-FP16 [paper]
[2019 | TCAS-I] A Low-Power Deep Neural Network Online Learning Processor for Real-Time Object Tracking Application [paper]
[2019 | ASPDAC] TNPU: an efficient accelerator architecture for training convolutional neural networks[paper]
[2019 | ASSCC] A 2.25 TOPS/W Fully-Integrated Deep CNN Learning Processor with On-Chip Training [paper]
[2019 | DAC] Acceleration of DNN Backward Propagation by Selective Computation of Gradients [paper]
[2019 | DAC] An Optimized Design Technique of Low-bit Neural Network Training for Personalization on IoT Devices [paper]
[2019 | ISSCC] LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 [paper]
[2019 | SysML] Mini-batch Serialization: CNN Training with Inter-layer Data Reuse [paper]
[2019 | VLSI] A 1.32 TOPS/W Energy Efficient Deep Neural Network Learning Processor with Direct Feedback Alignment based Heterogeneous Core Architecture [paper]
[2018 | LSSC] A Scalable Multi-TeraOPS Core for AI Training and Inference [paper]
[2018 | VLSI] A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference [paper]
[2017 | DAC] Design of an Energy-Efficient Accelerator for Training of Convolutional Neural Networks using Frequency-Domain Computation [paper]
[2017 | ISCA] SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks [paper]
[2014 | MICRO] DaDianNao: A Machine-Learning Supercomputer [paper]

FPGA

[2022 | TNNLS] ETA: An Efficient Training Accelerator for DNNs Based on Hardware-Algorithm Co-Optimization
[2021 | ICS] Enabling Energy-Efficient DNN Training on Hybrid GPU-FPGA Accelerators [paper]
[2020 | TC] A Neural Network-Based On-Device Learning Anomaly Detector for Edge Devices [paper]
[2020 | ICCAD] FPGA-based low-batch training accelerator for modern CNNs featuring high bandwidth memory [paper]
[2020 | IJCAI] Efficient and Modularized Training on FPGA for Real-time Applications [paper]
[2020 | ISCAS] Training Progressively Binarizing Deep Networks Using FPGAs [paper]
[2020 | FPL] Dynamically Growing Neural Network Architecture for Lifelong Deep Learning on the Edge [paper]
[2019 | FPT] Training Deep Neural Networks in Low-Precision with High Accuracy Using FPGAs [paper]
[2019 | NEWCAS] Efficient Hardware Implementation of Incremental Learning and Inference on Chip [paper]
[2019 | FPL] FPGA-Based Training Accelerator Utilizing Sparseness of Convolutional Neural Network [paper]
[2019 | FPL] Automatic Compiler Based FPGA Accelerator for CNN Training [paper]
[2019 | FCCM] Towards Efficient Deep Neural Network Training by FPGA-Based Batch-Level Parallelism [paper]
[2019 | FPGA] Compressed CNN Training with FPGA-based Accelerator [paper]
[2018 | FCCM] FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters [paper]
[2018 | FPL] A Framework for Acceleration of CNN Training on Deeply-Pipelined FPGA Clusters with Work and Weight Load Balancing [paper]
[2018 | FPL**] ClosNets: Batchless DNN Training with On-Chip a Priori Sparse Neural Topologies [paper]
[2018 | ReConFig] A Highly Parallel FPGA Implementation of Sparse Neural Network Training [paper]
[2018 | ISLPED] TrainWare: A Memory Optimized Weight Update Architecture for On-Device Convolutional Neural Network Training [paper]
[2017 | FPT] An FPGA-based processor for training convolutional neural networks [paper]
[2017 | FPT] FPGA-based training of convolutional neural networks with a reduced precision floating-point library [paper]
[2016 | ASAP] F-CNN: An FPGA-based framework for training Convolutional Neural Networks [paper]

PIM

[2021 | VLSI] CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI Accelerator with 2 MByte On-Chip Foundry Resistive RAM for Efficient Training and Inference
[2021 | TC] AILC: Accelerate On-chip Incremental Learning with Compute-in-Memory Technology [paper]
[2021 | TC] PANTHER: A Programmable Architecture for Neural Network Training Harnessing Energy-efficient ReRAM [paper]
[2019 | TC] A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets [paper]
[2018 | TCAD] DeepTrain: A Programmable Embedded Platform for Training Deep Neural Networks [paper]
[2017 | HPCA] PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning [paper]

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Efficient Training

Contents

Algorithm

Quantization

Pruning

Others

Hardware

Survey

ASIC

FPGA

PIM

About

Releases

Packages

jmluu/Awesome-Efficient-Training

Folders and files

Latest commit

History

Repository files navigation

Awesome Efficient Training

Contents

Algorithm

Quantization

Pruning

Others

Hardware

Survey

ASIC

FPGA

PIM

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages