Abstract

Machine Learning (ML) tasks are becoming pervasive in a broad range of applications, and in a broad range of systems
(from embedded systems to data centers).
As computer architectures evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators, designing hardware accelerators for ML techniques can simultaneously achieve high efficiency and broad application scope.

第二段

While efficient computational primitives are important for a hardware accelerator, inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an Amdahl’s law effect, and thus, they should become a first-order concern, just like in processors, rather than an element factored in accelerator design on a second step.
In this article, we introduce a series of hardware accelerators (i.e., the DianNao family) designed for ML(especially neural networks), with a special emphasis on the impact of memory on accelerator design, performance, and energy.
We show that, on a number of representative neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family).

As architectures evolve towards heterogeneous multi-cores composed of a mix of cores and accelerators, designing hardware accelerators which realize the best possible tradeoff between flexibility and efficiency is becoming a prominent
issue.
The first question is for which category of applications one should primarily design accelerators?
Together with the architecture trend towards accelerators, a second simultaneous and significant trend in high-performance and embedded applications is developing: many of the emerging high-performance and embedded applications, from image/video/audio recognition to automatic translation, business analytics, and robotics rely on machine learning
techniques.
This trend in application comes together with a third trend in machine learning (ML) where a small number
of techniques, based on neural networks (especially deep learning techniques 16, 26 ), have been proved in the past few
years to be state-of-the-art across a broad range of applications.
As a result, there is a unique opportunity to design accelerators having significant application scope as well as
high performance and efficiency. 4

Currently, ML workloads are mostly executed on multicores using SIMD, 44 on GPUs, 7 or on FPGAs. 2
However, the aforementioned trends have already been identified by researchers who have proposed accelerators implementing,
for example, Convolutional Neural Networks (CNNs) 2 or Multi-Layer Perceptrons 43 ;
accelerators focusing on other domains, such as image processing, also propose efficient implementations of some of the computational primitives used by machine-learning techniques, such as convolutions. 37
There are also ASIC implementations of ML techniques, such as Support Vector Machine and CNNs.
However, all these works have first, and successfully, focused on efficiently implementing the computational primitives but they either voluntarily ignore memory transfers for the sake of simplicity, 37, 43 or they directly plug their computational accelerator to memory via a more or less sophisticated DMA. 2, 12, 19

While efficient implementation of computational primitives is a first and important step with promising results,
inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an
Amdahl’s law effect, and thus, they should become a first-
order concern, just like in processors, rather than an element
factored in accelerator design on a second step.
Unlike in processors though, one can factor in the specific nature of
memory transfers in target algorithms, just like it is done for accelerating computations.
This is especially important in the domain of ML where there is a clear trend towards scaling up the size of learning models in order to achieve better accuracy and more functionality. 16, 24

In this article, we introduce a series of hardware accelerators designed for ML (especially neural networks), including
DianNao, DaDianNao, ShiDianNao, and PuDianNao as listed in Table 1.
We focus our study on memory usage, and we investigate the accelerator architecture to minimize memory
transfers and to perform them as efficiently as possible.

Neural network techniques have been proved in the past few years to be state-of-the-art across a broad range of applica-
tions.
DianNao is the first member of the DianNao accelerator family, which accommodates state-of-the-art neural network techniques (e.g., deep learning a ), and inherits the broad application scope of neural networks.

节能的硬件加速

The NFU implements a functional block of $T_i$ inputs/synapses(突触) and $T_n$ output neurons,
- which can be time-shared by different algorithmic blocks of neurons.
Depending on the layer type, computations at the NFU can be decomposed in either two or three stages.
For classifier and convolutional layers: multiplication of synapses $\times$ inputs, additions of all multiplications, sigmoid.
The nature of the last stage (sigmoid or another nonlinear function) can vary.
For pooling layers, there is no multiplication(no synapse), and the pooling operations can be average or max.
Note that the adders(加法器) have multiple inputs,
- they are in fact adder trees,
- see Figure 1;
- the second stage also contains shifters and max operators for pooling layers.
In the NFU,the sigmoid function (for classifier and convolutional layers)can be efficiently implemented using piecewise linear interpolation ( $f(x) = a_i x \times + b_i , x \in [x_i , x_{i+1} ]$ ) with negligible loss ofaccuracy (16 segments are sufficient). 22

The on-chip storage structures of DianNao can be construed as modified buffers of scratchpads.
While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative read, etc.) and cache conflicts.
The efficient alternative, scratchpad, is used in VLIW processors but it is known to be very difficult to compile for.
However a scratchpad in a dedicated accelerator realizes the best of both worlds: efficient
storage, and both efficient and easy exploitation of locality because only a few algorithms have to be manually adapted.

We split on-chip storage into three structures (NBin, NBout,and SB), because there are three type of data (input neurons,output neurons and synapses) with different characteristics (e.g., read width and reuse distance).
The first benefit of splitting structures is to tailor the SRAMs to the appropriate
read/write width,
and the second benefit of splitting storage structures is to avoid conflicts, as would occur in a cache.
Moreover, we implement three DMAs to exploit spatial locality of data, one for each buffer (two load DMAs for inputs, one store DMA for outputs).