Ultra-Low-Precision Training of Deep Neural Networks


Reading time ( words)

Over the past few years, reduced precision techniques have proven exceptionally effective in accelerating deep learning training and inference applications. IBM Research has played a leadership role in developing reduced precision technologies and pioneered a number of key breakthroughs, including the first 16-bit reduced-precision systems for deep learning training (presented at ICML 2015), the first 8-bit training techniques (presented recently at NeurIPS 2018), and state-of-the-art 2-bit inference results (published at SysML 2019). Following this line of work, we now introduce a new breakthrough which solves a long-ignored, yet important problem in reduced-precision deep learning: accumulation bit-width scaling for ultra-low-precision training of deep neural networks (DNNs).

The most commonly used arithmetic function in deep learning is the dot product, which is the building block of generalized matrix multiplication (GEMM) and convolution computations. A dot product is computed used multiply accumulate (MAC) operations and thus requires two floating-point computations: multiplication of 2 numbers and accumulation of the product into partial sums. Today, much of the effort on reduced-precision deep learning focuses solely on quantizing representations, i.e. input operands to the multiplication operation. The other major portion of dot product computations, i.e. the partial sum accumulation, has always kept in full (32-bit) precision. The reason is that reduced-precision accumulations can result in severe training instability and degradation in model accuracy, as shown in Fig. 1a. This is especially unfortunate, since the area and the power of the hardware is dominated by the accumulator bit-width as the precisions are aggressively reduced. As shown in Fig. 1b, accumulating in high precision severely limits the hardware benefits of reduced-precision data representations and computation. The absence of any framework to analyze the precision requirements of partial sum accumulations inevitably results in very conservative design choices.

network1.jpg 

Figure 1. The importance of accumulation precision. (a) Convergence curves of an ImageNet ResNet18 experiment using reduced precision accumulation. The current practice is to keep the accumulation in full precision to avoid such divergence. (b) Estimated area benefits when reducing the precision of a floating-point unit (FPU). The terminology FPa/b denotes an FPU whose multiplier and adder use a and b bits, respectively. Our work enables convergence in reduced precision accumulation and gains an extra 1.5–2.2× area reduction.

In our paper published at ICLR 2019, we take a step forward and present a comprehensive statistical model to analyze the impact of reduced-precision accumulation in deep learning training. We observed and learned two critical insights. First, when we accumulate a dot product in scaled precision, the loss of information is primarily due to a so-called “swamping error”. In floating point arithmetic, swamping error occurs when a large number is added to a small number, the small number will be completely or partially truncated out of the addition. Second, swamping error will harm the statistics (i.e. variance) of a dot product. To ensure stable convergence of DNNs, it is a necessity to preserve the variance of dot products under reduced precision.

Using these insights, we derived a set of equations and introduced a new metric called variance retention ratio (VRR) of a reduced-precision accumulation in the context of the three deep learning GEMM functions. The VRR is a function of the accumulation length and minimum number of bits needed for accumulation only, which needs no simulation to be computed (Fig. 2). The VRR can be used to assess the suitability, or lack thereof, of a precision configuration and allows us to determine accumulation bit-widths for precise tailoring of deep learning computation hardware. From these VRR calculations in Fig. 2, it can be easily seen that chunk-based accumulations for typical deep learning computations can preserve accuracy down to 9-bits of accumulation precision, macc, (which corresponds to fp16 accumulations) while traditional non-chunk additions need much higher macc (of up to 15-bits, corresponding to fp32 accumulations). This reduced accumulation bit-width requirement translates directly to a 1.5–2.2× improvement in hardware energy efficiency (as indicated in Fig. 1).

network2.jpg 

Figure 2. Normalized variance lost as a function of accumulation length for different values of accumulation bit-width, macc, for (a) a normal accumulation (no chunking) and (b) a chunk-based accumulation (chunk size of 64). The ”knees” in each plot correspond to the maximum accumulation length for a given precision which indicates how the VRR is to be used to select a suitable precision.

Using the analysis, we successfully predicted and experimentally verified the minimum accumulation precisions required by the three GEMM functions across three popular benchmarking networks (CIFAR-10 ResNet34, ImageNet ResNet18, and ImageNet AlexNet). Our results prove that our method is able to accurately pinpoint the minimum precision needed for the convergence of benchmark networks to the full-precision baseline. On the practical side, this analysis is a useful tool for hardware designers implementing reduced precision processing hardware. We believe this work addresses a critical missing link on the path to ultra-low-precision hardware for DNN training.

Share

Print


Suggested Items

DARPA Tests Advanced Chemical Sensors

05/03/2019 | DARPA
DARPA’s SIGMA program, which began in 2014, has demonstrated a city-scale capability for detecting radiological and nuclear threats that is now being operationally deployed.

DARPA's Assured Autonomy Program Seeks to Guarantee Safety of Learning-enabled Autonomous Systems

08/17/2017 | DARPA
Building on recent breakthroughs in autonomous cyber systems and formal methods, DARPA today announced a new research program called Assured Autonomy that aims to advance the ways computing systems can learn and evolve to better manage variations in the environment and enhance the predictability of autonomous systems like driverless vehicles and unmanned aerial vehicles (UAVs).

Enabling Extreme New Designs for Optics and Imagers

08/22/2016 | DARPA
DARPA seeks engineered optical materials unconstrained by “laws” of classical optics to develop vastly smaller, lighter, and more capable devices for advanced imaging applications.



Copyright © 2019 I-Connect007. All rights reserved.