Survey: NPU Architecture

Machine Learning Algorithm

[>] In general, modern ML algorithms can be divided into two categories: artificial neural networks (ANNs), where data is represented as numerical values [9], and spiking neural networks (SNNs), where data is represented by spikes [10].

A DNN is a parameterized function that takes a high dimensional input to make useful predictions—that is, a classification label. This prediction process is called inference. To obtain a, meaningful set of parameters, training of the DNN is performed on a training dataset, and the parameters are optimized via approaches such as stochastic gradient descent (SGD) in order to minimize a certain loss function.

A DNN is typically a stack of NN layers as follows:

$$ f(x)=f_{l-1}\circ f_{l-2}\circ ... \circ f_{2}\circ f_{1} (x) $$

Computation Patterns

[>] If we use $I_c$, $O_c$, $B$ to denote the number of input channels, number of output channels, and batch size, respectively, the computation can be written as follows:

$$ output_{b,o_c} = \displaystyle\sum_{i_c = 0}^{I_c - 1}{input_{b,i_c}} \times weight_{i_c ,o_c} $$

Untitled

Convolutions in DNN can be viewed as an extended version of matrix multiplications, which adds the properties of local connectivity and translation invariance. The formal description is shown below:

$$ output_{b,o_c ,x,y} = \displaystyle\sum_{i_c = 0}^{I_c - 1}\sum_{i = 0}^{F_h - 1}\sum_{j = 0}^{F_w - 1}{input_{b,i_c ,x+i,y+j}} \times filter_{o_c ,i_c ,i,j_c} $$

Where $F_h$ is the height of the filter, $F_w$ is the width of the filter, i is the index of the row in a 2D filter, $j$ is the index of the column in a 2D filter, $x$ is the index of the row in a 2D feature map, $y$ is the index of the column in a 2D feature map.

Untitled

To provide translation invariance, the same convolutional filter is repeatedly applied to all the parts of the input feature map, making the data reuse pattern in convolutions much more complex than in matrix multiplications. It is worth noting that, although the computation patterns of matrix multiplications and convolutions are very different, they can actually be converted to each other.

Problems of ML Inferencing and Training

[>] With the end of Moore’s law and Dennard scaling [58], there are various challenges for computing platforms, such as “power wall” [140] and “memory wall” [127]. Unfortunately, traditional serial execution platforms cannot solve these problems and meet the computing capability needs of deep learning [35, 58]. So, many studies choose high-performance computing platforms rather than serial execution platforms to run the DNN applications, such as GPU, FPGA, and ASIC platforms [26, 27, 59, 143]. But the quality of the hardware design directly determines the performance and efficiency of the FPGA or ASIC platform for DNN applications [35].

[>] Recent AI-specific computing systems—that is, AI accelerators [20, 21, 22, 23, 55, 56, 57]—are often constructed with a large number of highly parallel computing and storage units. These units are organized in a two-dimensional (2D) way to support common matrix–vector multiplications in NNs. Network-on-chip (NoC) [13], high bandwidth memory (HBM) [14], data reuse [15], and so forth are applied to further optimize the data traffic in these accelerators.

In DNN training, the data dependency is twice as deep as it is in inference. Although the dataflow of the forward pass is the same as the inference, the backward pass then executes the layers in a reversed order. Moreover, the outputs of each layer in the forward pass are reused in the backward pass to calculate the errors (because of the chain rule of back-propagation), resulting in many long data dependencies.

Machine Learning Algorithm

Computation Patterns

Problems of ML Inferencing and Training

Basic NPU Architecture