Computer Vision Algorithms and Hardware Implementations A Survey

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Integration, the VLSI Journal xxx (xxxx) xxx

Contents lists available at ScienceDirect

Integration, the VLSI Journal


journal homepage: www.elsevier.com/locate/vlsi

Computer vision algorithms and hardware implementations: A survey


Xin Feng a, b, Youni Jiang a, Xuejiao Yang a, Ming Du c, **, Xin Li b, *
a
Computer Science and Engineering Department, Chongqing University of Technology, Chongqing, 400054, China
b
Data Science Research Center, Duke Kunshan University, Kunshan, Jiangsu, 215316, China
c
Donghua University, Shanghai, 200051, China

A R T I C L E I N F O A B S T R A C T

Keywords: The field of computer vision is experiencing a great-leap-forward development today. This paper aims at
Computer vision providing a comprehensive survey of the recent progress on computer vision algorithms and their corresponding
Hardware accelerator hardware implementations. In particular, the prominent achievements in computer vision tasks such as image
Deep convolutional neural network
classification, object detection and image segmentation brought by deep learning techniques are highlighted. On
Artificial intelligence
the other hand, review of techniques for implementing and optimizing deep-learning-based computer vision al-
gorithms on GPU, FPGA and other new generations of hardware accelerators are presented to facilitate real-time
and/or energy-efficient operations. Finally, several promising directions for future research are presented to
motivate further development in the field.

1. Introduction practically important.


Early efforts have made a great contribution to the philosophy of
The recent progress of scientific technologies is producing a human vision and the basic computational theory of computer vision by
“Cambrian explosive” [1] in developing new techniques that lead the exploiting well-designed features and feature descriptors combined with
world entering promptly into the new artificial intelligence (AI) era. classical machine learning methods [3,4]. Although researchers have
Computer systems injected by the new AI techniques are intelligent to spent several decades to teach machines how to see, the most advanced
perceive and understand the visual world, and even smarter than humans machine at that time could only perceive common objects and struggled
in a number of specific tasks. The ability of being smart is primarily at recognizing numbers of natural objects with infinite shape variations
provided by a computer vision system, including both the algorithms and similar to toddlers [5]. Fortunately, researchers have believed that
their hardware implementations, which gives us the ability to teach a computer systems can go beyond regular object recognition and learn to
computer to understand the physical world from vision. reveal details and insights of the visual world by training them to see
Computer vision tasks seek to enable computer system automatically trillions of images and videos generated from Internet. To nourish the
to see, identify and understand the visual world, simulating the same way computer brain, the largest image classification dataset “ImageNet” [6]
that human vision does [2]. Researchers in computer vision aspired to that contains 15 million images across 22,000 classes of objects was
develop algorithms for such visual perception tasks including (i) object created, upon which the well-known “deep learning” technology has
recognition in order to determine whether image data contains a specific demonstrated its overwhelming superiority over traditional computer
object, (ii) object detection in order to localize instances of semantic vison algorithms that treat objects as a collection of shape and color
objects of a given class, and (iii) scene understanding to parse an image features.
into meaningful segments for analyzing. Given the broad range of Deep learning is a particular class of machine learning algorithm,
mathematics being covered and the intrinsically difficult nature of which typically simplifies the process of feature extraction and descrip-
recovering unknowns from insufficient information to fully specify the tion through a multi-layer convolutional neural network (CNN). CNN
solution, the aforementioned tasks in the computer vision field are aims to transform the high-dimension input image into low-dimension
extremely challenging. Studying these problems is both theoretically and yet highly-abstracted semantic output. Powered by the massive data

* Corresponding author.
** Corresponding author.
E-mail addresses: [email protected] (X. Feng), [email protected] (Y. Jiang), [email protected] (X. Yang), [email protected] (M. Du), xinli.ece@
duke.edu (X. Li).

https://doi.org/10.1016/j.vlsi.2019.07.005
Received 11 April 2019; Received in revised form 23 June 2019; Accepted 27 July 2019
Available online xxxx
0167-9260/© 2019 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Please cite this article as: X. Feng et al., Computer vision algorithms and hardware implementations: A survey, Integration, the VLSI Journal, https://
doi.org/10.1016/j.vlsi.2019.07.005
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

from ImageNet and the modern central processing units (CPUs) and modern history of neural network after a long trough period.
graphics processing units (GPUs), methods based on deep neural network A typical deep CNN model consists of several convolution layers
(DNN) achieve the state-of-the-art performance and bring an unprece- followed by activation functions and pooling layers, and several fully
dented development of computer vision in both algorithms and hardware connected layers before prediction. It comes into deep structure to
implementations. In recent years, CNN has become the de-facto standard facilitate filtering mechanisms by performing convolutions in multi-scale
computation framework in computer vision. Numbers of deeper and feature maps, leading to highly abstract and discriminative features.
more complicated networks are developed to make CNNs deliver near- AlexNet has 8 convolution layers, 3 pooling layers and 3 fully con-
human accuracy in many computer vision applications, such as classifi- nected layers, with a total of 60 million parameters. It successfully uses
cation, detection and segmentation. The high accuracy, however, comes ReLU as the activation function instead of sigmoid. Furthermore, data
at the price of large computational cost. As a result, dedicated hardware augmentation and dropout are widely used today as efficient learning
platforms, from the general-purpose GPUs to application-specific pro- strategies. AlexNet is hence known as the foundation work of modern
cessors, are investigated to optimize for DNN-based workloads. deep CNN.
In this paper, we look into this rapid evolution of computer vision Inspired by AlexNet, VGGNet [22] and GoogleNet [23] focus on
field by presenting a brief survey on the key algorithms that make designing deeper networks to further improve accuracy. They were the
computer systems perceivable and the underlying hardware platforms runner-up and winner of ILSVRC in 2014 respectively. By repeatedly
that make these algorithms applicable. In particular, we will discuss how stacking 3  3 convolutional kernels and 2  2 maximum pooling layers,
the recent DNN algorithms accomplish the computer vision tasks (i.e. VGGNet successfully constructs a convolutional neural network of 16–19
image classification, object detection and image segmentation) with high layers. GoogleNet has 22 layers, but its floating-point operations and
perception accuracy, and summarize the notable hardware units number of parameters are much less than those of AlexNet and VGGNet
including GPUs, field-programmable gate arrays (FPGAs) and other by removing the fully-connected layers and optimizing the operations of
advanced mobile hardware platforms that are adapted or designed to sparse matrices.
accelerate DNN-based computer vision algorithms. According to our Although deeper networks offer better accuracy, simply increasing
knowledge, there are recent summaries in the literature that discuss the the number of layers cannot continuously improve accuracy because of
DNN-based algorithms for particular tasks, including image classification vanishing/exploding gradient information during network training.
[7], object detection [8], image segmentation [9], and the corresponding ResNet [24], which makes another great progress of deep network
hardware accelerators such as FPGAs [10]. There is no comprehensive structure, proposes to use a shortcut connection between residual blocks
survey that covers both algorithm and hardware simultaneously. A to make full use of information from previous layers and keep the gra-
thorough review of existing works from both topics is essential for re- dients during backward propagation. By using this residual block, ResNet
searchers to understand the entire picture and motivate further progress successfully trains very deep networks with up to 152 layers and was the
in the computer vision field. winner of ILSVRC in 2015. Following the idea of ResNet, DenseNet [25]
The reminder of this paper is organized as follows. In Section 2, we establishes connections between all previous layers and the current layer.
overview the computer vision algorithms for three visual perception It concatenates and, therefore, reuses the features from all previous
tasks: image classification, object detection and image segmentation. layers. DenseNet presents with great advantage in classification accuracy
Important hardware platforms including GPUs, FPGAs and other hard- on ImageNet, as we can see in Table 1. Based on these works in the
ware accelerators for implementing the DNN-based algorithms are dis- literature, connecting different network layers has shown promising
cussed in Section 3. Finally, we conclude in Section 4. improvement in learning representations of deeper networks.
By using ResNet or DenseNet as the major backbone structure, re-
2. Computer vision algorithms searchers focus on improving the functionality of neural network blocks.
SENet [26], which was the winner of ILSVRC 2017, proposes a “squee-
2.1. Image classification ze-and-excitation” (SE) unit by taking channel relationship into account.
It learns to recalibrate channel-wise feature maps by explicitly modeling
Image classification is a kind of biologically primary ability of human the interdependencies among channels, which is consequently exploited
visual perception system. It has been an active task and plays a crucial to enhance informative channels and suppress other useless channels.
role in the field of computer vision, which aims to automatically classify Despite the high classification performance of the aforementioned
images into pre-defined classes. For decades, researchers have laid path CNN models, appropriately designing the optimal network structure
in developing advanced techniques to improve the classification accu-
racy. Traditionally, classification models can perform well only on small
Table 1
datasets such as CIFAR-10 [11] and MNIST [12]. The great-leap-forward Summary of different CNN models on ImageNet classification task.
development of image classification occurred when the large-scale image
Model Time Accuracy Num. of Num. of Num. of
dataset “ImageNet” was created by Feifei Li in 2009 [6]. It was almost the Parameters FLOPs Layers
same time when the well-known deep learning technologies started to
AlexNet [21] 2012 57.2% 60 M 720 M 8
show great performance in classification and stepped onto the stage of
VGGNet [22] 2014 71.5% 138 M 15,300 M 16
computer vision. GoogleNet 2014 69.8% 6.8 M 1,500 M 22
Before the explosion of deep learning methods, research works put [23]
lots of efforts in designing scale-invariant features (e.g. SIFT [13], HOG ResNet [24] 2015 78.6% 55 M 2,300 M 152
[14], GIST [15]), feature representations (e.g. Bag-of-Features [16], DenseNet 2017 79.2% 25.6 M 1,150 M 190
[25]
Fisher Kernel [17]) and classifiers (e.g. SVM [18]) for image classifica- SENet [26] 2017 82.7% 145.8 M 42,300 M –
tion [19,20]. However, these manually crafted features work against NASNet [27] 2018 82.7% 88.9 M 23,800 M –
objects in natural images with complicated background, variant color, SqueezeNet 2016 57.5% 1.2 M 833 M –
texture, illumination and ever-changing poses and view factors. At the [29]
MobileNet 2017 70.6% 4.2 M 569 M 28
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012,
[30]
AlexNet [21] won the first prize by a significant margin over the second ShuffleNet 2018 73.7% 4.7 M 524 M –
place that was based on SIFT and Fisher Vectors (FVs) [20]. It demon- [31]
strates that the classification model based on deep CNN performs much ShiftNet-A 2018 70.1% 4.1 M 1,400 M –
more robustly than other conventional methods in the presence of [32]
FE-Net [33] 2019 75.0% 5.9 M 563 M –
large-scale variations. It also represents a remarkable milestone in the

2
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

often requires significant engineering work. NASNet [27] studies a C3D features [43] are derived from a deep 3-D convolutional network
paradigm to learn the optimal convolutional architecture based on trained on the large-scale UCF101 dataset. Moreover, a two-stream
training data. It adopts a neural architecture search (NAS) framework approach [44] is proposed to factorize the learning problem of video
derived from reinforcement learning [28]. In addition, it designs a new representation into spatial and temporal cues separately. Specifically, a
search space to enable network mapping from a proxy dataset (e.g. spatial CNN is adopted to model the appearance information from RGB
CIFAR-10) to ImageNet, and a regularization technique for generaliza- frames, while a temporal CNN is used to learn the motion information
tion purpose. Offering less computational complexity than SENet, NAS- from the dense optical flow among adjacent frames.
Net achieves the state-of-the-art accuracy on ImageNet, as shown in Since the two-stream approach only depicts movements within a
Table 1. short time window and fails to consider the temporal order of different
The aforementioned deep networks facilitate classification to be more frames, several recurrent connection models for sequential data,
accurate. However, in many real-life classification applications, such as including recurrent neural networks (RNNs) and long short-term memory
robotics, autonomous driving, smartphone, etc., the classification task is (LSTM) models, are leveraged to model the temporal dynamics for
highly constrained by the computational resources that are available. The videos. In Ref. [45], two two-layer LSTM networks are trained with
problem thus becomes to pursue the optimal accuracy subject to a limited features from the two-stream approach for action recognition. In
computational budget (i.e. memory and/or MFLOPs). Therefore, a set of Ref. [46], the LSTM model and CNN model are combined to jointly learn
lightweight networks such as SqueezeNet [29], MobileNet [30], Shuf- spatial-temporal cues for video classification. In Refs. [47,48], attention
fleNet [31], ShiftNet [32] and FE-Net [33] start a wave. mechanism is introduced for convolutional LSTM models to discover
SqueezeNet substitutes most 3  3 filters by 1  1 filters and cuts relevant spatio-temporal volumes for video classification.
down the numbers of input channels for 3  3 filters to reduce the
network complexity. To maximize the accuracy with a limited number of 2.2. Object detection
network parameters, it delays the down-sampling operation to avoid
information loss in early layers. SqueezeNet is 50  smaller than AlexNet. Object detection, which is to determine and locate the object in-
If combined with deep compression [34], it can even be reduced to be stances either from a large number of predefined categories in natural
510  smaller than AlexNet. Depth-wise separable convolution is images or for a given particular object (e.g., Donald Trump's face, the
employed to decompose the standard convolution into depth-wise distorted area in an image, etc.), is another important and challenging
convolution and point-wise convolution in MobileNet [30]. Depth-wise task in computer vision. Object detection and image classification share a
convolution performs convolution on each input channel with one fil- similar technical challenge: both of them must handle a large number of
ter, while point-wise convolution combines those separate channels by highly variable objects. However, object detection is more difficult than
using 1  1 convolution. This novel design of convolution reduces both image classification, as it must identify the accurate localization of the
the computational complexity and the number of parameters. object of interest.
ShuffleNet [31] uses point-wise group convolution that divides the Historically, most research efforts have focused on detecting a single
input feature maps into groups and performs convolutions separately on category of given objects such as pedestrian [14,49] and face [50] by
each group to reduce computational cost. However, because the grouping designing a set of appropriate features (e.g. HOG [14,49], Harr-like [50],
operations limit the communication between different channels, Shuf- LBP [51], etc.). In these works, objects are detected by using a set of
fleNet further shuffles the channels and feeds each group in the following predefined feature templates matching with each location in the image or
layer with multiple channels from different groups in order to distribute feature pyramids. Standard classifiers such as SVM [14,49] and Adaboost
information across channels. [50] are often used for this purpose.
In addition to the strategies adopted to reduce the computational cost In order to build a general-purpose, robust object detection system,
of spatial convolution (e.g., depth-wise convolution), ShiftNet [32] pre- research community has started to develop large-scale, multi-class
sents a parameter-free, FLOP-free shift operation to replace expensive datasets in recent years. Pascal-VOC 2007 [52] with 20 classes and
spatial convolutions. The proposed shift operation is able to provide MS-COCO [53] with 80 object categories are two iconic object detection
spatial information communication by shifting feature maps, making it datasets. In these two datasets, detection results are evaluated by two
possible to aggregate spatial information by the following point-wise possible metrics: (i) Average Precision (AP) by counting the correctly
convolutional layer. More recently, FE-Net [33] further finds that only detected bounding boxes for which the overlap ratio exceeds 0.5, and (ii)
a few shift operations are sufficient to provide spatial information mean Average Precision (mAP) by averaging the AP values associated
communication. A sparse shift layer (SSL) is proposed to perform shift with different thresholds of the overlap ratio.
operations on a small portion of feature maps only. With only 563 M Recently, deep learning has substantially advanced the object detec-
FLOPs, FE-Net achieves the state-of-the-art performance among all major tion field. As shown in Fig. 1, striking improvements in object detection
lightweight classification models on ImageNet, as shown in Table 1. accuracy have been demonstrated over both Pascal-VOC 2007 and MS-
The aforementioned network models are briefly summarized in COCO by taking advantages of deep learning techniques.
Table 1. In addition to the conventional image classification problem R-CNN [54] was the first two-stage method among the earliest
with thousands of classes and complex scenes, multi-label classification CNN-based generic object detection techniques. It adopts AlexNet to
(e.g. face attributions [35,36]) and fine-grained classification (e.g. extract a fixed-length feature vector from each resized region proposal,
Stanford Dogs classification [37]) are also of great interest in the com- which is the object candidate generated by selective search algorithm
puter vision area. [55]. Each region is then classified by a set of category-specific linear
Furthermore, the great success of deep learning in image domain has SVMs. The method shows significant improvement in mAP over the
stimulated a variety of techniques to learn robust feature representations traditional state-of-the-art DPM detector [49]. It is, however, not elegant
for video classification, where the semantic contents such as human ac- and inefficient, due to its multistage complex pipeline and the redundant
tions [38] or complex events [39] are automatically categorized. Early CNN feature extraction from numerous region proposals.
works often treat a video clip as a collection of frames. Video classifica- Inspired by the spatial pyramid pooling in SPPnet [56] that leverages
tion is implemented by aggregating frame-level CNN features by aver- fixed-length feature outputs for arbitrary input image sizes, Fast R-CNN
aging or encoding [40]. Standard classifiers, such as SVM, are finally [57] incorporates a ROI pooling layer before the fully-connected layer to
used for recognition [41,42]. obtain a fixed-length feature vector for each proposed region, so that only
In contrast to the frame-level classification methods, there are a a single convolution operation is required for the input image. Fast
number of other approaches applying end-to-end CNN models to learn R-CNN substantially improves the detection efficiency over R-CNN and
the hidden spatio-temporal patterns in video. For example, the typical SPPnet. However, it requires expensive computation for external region

3
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

Table 2
Summary of different object detection architectures (FPS based on a Pascal Titan
X GPU).
Architecture mAP (Pascal-VOC mAP (MS- Num. of FPS
2007) COCO) FLOPs

R-CNN [54] 66.0% – – 0.1


SPPnet [56] 63.1% – – 1
Fast R-CNN [57] 70.0% 35.9% – 0.5
Faster R-CNN 73.2% 36.2% – 6
[58]
Mask R-CNN [59] – 39.8% – 3.3
YOLO [60] 63.4% – – 45
SSD513 [61] 76.8% 31.2% – 8
YOLOv2 (608) 78.6% 21.6% 62.94 G 67
[62]
YOLOv3 (416) – 33.0% 65.86 G 35
[63]

small receptive fields. In addition, it uses a powerful backbone network


darknet-53 with several sets of residual blocks. YOLOv3 achieves similar
accuracy as Faster R-CNN, while maintaining real-time efficiency. It is
the current state-of-the-art object detection framework for real-time
Fig. 1. Object detection accuracy on (a) Pascal-VOC 2007 and (b) MS-COCO for applications.
different object detection techniques based on deep learning. We summarize the performance metrics of the aforementioned
detection models in Table 2. In addition to the one-stage and two-stage
proposal, which now becomes the major bottleneck in overall efficiency. architectures, there are several other spotlights for object detection.
To address the aforementioned challenge associated with computa- For example, the relationship of different objects is considered by
tional complexity, Faster R-CNN [58] further proposes a Region Proposal designing an object relation module in Ref. [64]. Generative Adversarial
Network (RPN). It then integrates both RPN (for proposal generation) Network (GAN) is used to generate the super-resolution of small object
and Fast-RCNN (for region classification) into a unified, end-to-end patches [65] or features [66] to help with the detection of small objects.
network structure. RPN and Fast-RCNN share most of the convolution In contrast to the significant progress in object detection focusing on
layers, and the features from the last shared layer are used for two still images, video object detection has received less attention. Generally,
separate tasks (i.e. proposal generation and region classification). With object detection for videos is realized by fusing the results of object
this highly efficient architecture, Faster R-CNN achieves 6 FPS inference detection on the current frame and object tracking from the previous
speed on a GPU and the state-of-the-art detection accuracy on frames. Deep SORT [67] is a typical tracking-by-detection algorithm for
Pascal-VOC 2007. multi-object tracking in videos. By integrating appearance information
As a follow-up work on Faster R-CNN, Mask R-CNN [59] was later extracted from a CNN-based object detector with the original SORT [68]
proposed to combine object detection and pixel-level instance segmen- algorithm, it is able to achieve real-time tracking. Recently, several ap-
tation based on Faster R-CNN. By using ResNet101-FPN as the backbone proaches have been proposed for end-to-end video object detection. In
network, Mask R-CNN demonstrated the best detection accuracy on Ref. [69], temporal feature aggregation is performed to improve feature
MS-COCO in 2017. quality and recognition accuracy. In Ref. [70], data redundancy between
The two-stage R-CNN architectures are able to offer superior accu- consecutive frames is exploited. These seminal works have been adopted
racy. However, real-time efficiency is required for object detection by by the winner of ImageNet Video Object Detection Challenge 2017 [71].
many real-world applications. In this regard, the simple one-stage ar-
chitectures are often preferred. YOLO [60] is the first one-stage method
2.3. Image segmentation
that casts detection task as a regression problem. It divides an image into
a number of S  S grids and proposes B bounding boxes for each grid.
Image classification should recognize what objects are in the visual
Next, by using the CNN features of the input image globally, it directly
scene (as shown by the example in Fig. 2(a)), while object detection re-
predicts the coordinates, confidence scores and C-class probabilities for
veals where the objects are (as shown by the example in Fig. 2(b)). In this
these bounding boxes. Without the proposal generation step, YOLO can
sub-section, we further focus on the problem of how the objects are
achieve the real-time speed of 45 FPS. On the other hand, since YOLO
exactly presented in the visual scene by using image segmentation.
investigates few prediction candidates, it is less accurate than the
Image segmentation is regarded as pixel-level classification, which
two-stage models, especially for small objects detection.
aims at dividing an image into meaningful regions by classifying each
The Single Shot MultiBox Detector (SSD) [61] follows a similar
pixel into a specific entity. In traditional image segmentation, the idea of
one-stage strategy. It outperforms YOLO in accuracy due to two major
unsupervised local region merging and splitting has been extensively
improvements. First, SSD extracts important features from multi-scale
explored based on clustering [72], optimizing global criteria [73], or user
CNN feature maps. Second, it adopts a number of default bounding
interaction [74]. The blooming deep learning technologies have pro-
boxes by following the concept of anchor proposed by Faster R-CNN.
moted large-scale supervised classification moving from image-level
YOLOv2 absorbs the merit of SSD and Faster R-CNN by introducing
object classification to box-level object localization, and further to
the anchor mechanism [62]. The new YOLOv2 model both improves
pixel-wise object segmentation. Therefore, today's image segmentation is
detection accuracy and reduces inference time by a large margin over
object oriented and can be divided into two subtle branches: (i) semantic
YOLO on Pascal-VOC 2007. However, its accuracy is still worse than the
segmentation, which assigns each pixel in an image to a semantic object
two-stage methods for generic detection tasks (e.g. MS-COCO).
class, as shown in Fig. 2 (c), and (ii) instance segmentation, which pre-
In the latest implementation of YOLOv3 [63], several anchor boxes
dicts different labels for different object instances as a further improve-
are assigned on three different scaled feature maps, thereby producing
ment to semantic segmentation, as shown in Fig. 2 (d).
much more proposals than YOLO and YOLOv2. Small objects can thus be
In addition to classification and detection, the challenges of Pascal-
accurately detected from the anchors in low-level feature maps with
VOC 2012 [76] and MS-COCO provide segmentation competitions as

4
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 2. An example of different visual perception problems [75]: (a) image classification, (b) object detection, (c) semantic segmentation, and (d) instance
segmentation.

well. They are two important datasets for image segmentation on which generate the restored feature maps for dense prediction. In another
most research works are evaluated. In addition to them, Cityscapes [77] typical encoder-decoder framework U-Net [83] that is designed for
and MVD [78] provide street scenes with a large number of traffic objects biomedical image segmentation, the upsampling layer directly concate-
in each image. The popular metrics to evaluate pixel-level segmentation nates with a set of cropped duplicates of corresponding feature maps in
accuracy are pixel accuracy (PA, i.e., the proportion of pixels predicted encoder to enhance resolution during the decoder process.
correctly), mean pixel accuracy (mPA) of all classes, mean intersection Image segmentation is a difficult problem that requires both good
over union (mIOU), and frequency weighted intersection over union pixel-level accuracy, which relies on fine-grained local features, and
(FWIOU). Among them, mIOU is often preferred for its simplicity and classification accuracy, for which global context of the image is crucial to
representativeness. resolve local ambiguities. However, the pooling strategy in classic CNN
Early segmentation techniques based on deep learning usually apply architectures is a defect for losing the detailed information when multi-
CNN as the feature descriptor for each pixel that is described by its sur- pooling steps are performed.
rounding patch [79,80]. This CNN-based framework is problematic in One possible and common solution to integrate context knowledge is
efficiency and is not sufficiently accurate for redundant feature extrac- to refine the output to have fine-grained details for accurate segmenta-
tion. Fully convolutional network (FCN) [81] is the forerunner that tion by Conditional Random Field (CRF) [85–87]. Alternatively, instead
successfully implements pixel-wise dense predictions for semantic seg- of using pooling strategy, the problem can be solved by expanding
mentation in an end-to-end CNN structure. It replaces the fully-connected receptive fields in which each neuron is connected to a subset of neurons
layers of the well-known classification architectures (e.g.VGG [22], in the previous layer. Dilated convolution [88], which is a regular
GoogleNet [23], etc.) with convolution layers to facilitate inputs of convolution with upsampled filters or dilated filters, is proposed to
arbitrary sizes, and outputs a heatmap rather than a vector to indicate exponentially expand receptive fields without sacrificing resolution, as
classification scores. Prediction loss is then measured by the pixel-wise shown in Fig. 3. The DeepLab [85] model takes advantages of both
loss between the upsampled heatmap using deconvolution and the dilated convolution and CRF refinement by post-processing to integrate
labeled image of original size. FCN shows great improvement in pixel context knowledge. As can be seen in Fig. 4, it achieves much higher
accuracy over traditional segmentation methods on Pascal-VOC 2012. prediction accuracy than SegNet and FCN and is thus considered as a
However, the basic FCN structure fails to capture a large number of milestone work for semantic segmentation.
features and it does not consider spatial consistency between pixels, Several recently-proposed methods, such as RefineNet [84] and
which hinders its application to certain problems and scenarios. PSPNet [89], try to avoid or restore the loss of down-sampling in encoder
In any wise, the success of FCN architecture makes it popular and it by fusing low-level and high-level features. RefineNet designs a decoder
has been actively followed by many subsequent segmentation works module by using both short-range and long-range residual connections to
[82–84]. Generally, using classification models without fully-connected capture rich contextual information. It has achieved the state-of-the-art
layers as the backbone network to produce low-resolution feature maps performance on 7 public datasets. In PSPNet, a pyramid pooling mod-
is referred to as the encoder, while the symmetric mapping from the ule is proposed to aggregate different region-based context information
low-resolution image to pixel-wise classification outcome is termed as to exploit the capability of global context information.
the decoder. With the well-known backbone network as encoder, alter- Inspired by the spatial pyramid pooling, DeepLab V2 [87] in-
native CNN-based segmentation works are usually variant in decoder vestigates an atrous spatial pyramid pooling (ASPP) module by incor-
implementation. For example, the decoding stage of SegNet [82] uses the porating dilated convolution with different sampling rates and spatial
max-pooling indices from corresponding feature maps in its encoder for pyramid pooling to capture multi-level context information. In the latest
upsampling. The resultant maps are then convolved with a set of filters to DeepLab V3þ [90], an Xception module [91] is introduced to the

5
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

Fig. 3. The receptive field of (a) 1-dilated convolution, (b) 2-dilated convolution, and (c) 4-dilated convolution [88].

encoder, together with the improved ASPP module. It obtains the CPUs [98]. Historically, the von-Neumann-style compute-centric archi-
state-of-the-art performance on Pascal-VOC 2012, as shown in Fig. 4. tectures (e.g. CPUs) are primarily designed for effective serial computa-
The aforementioned end-to-end architectures mainly focus on se- tions with complex task scheduling. It suffers from high energy
mantic segmentation. Comparatively, most works on instance segmen- consumption and low memory bandwidth for data movement when
tation follow the pipeline that segmentation precedes recognition. For evaluating the deep CNN networks that require parallel dense compu-
example, DeepMask [92] and the instance-sensitive fully convolutional tation, high data reusability and large memory bandwidth [99].
network [93] use Fast-RCNN to classify the learned segment proposals. Fig. 5(a) compares neural network with other approaches in terms of
The fully convolutional instance segmentation (FCIS) combines the accuracy and scale (i.e. data/model size). The traditional machine
segment proposal system [93] and object detection in Ref. [94] to predict learning methods, such as decision tree, SVM, etc., are referred as “Other
object classes, boxes, and masks simultaneously. However, this method is Approaches” in Fig. 5(a). They are generally based on manually designed
not as accurate as Mask-RCNN [59] on the MS-COCO instance segmen- features. Due to the limited learning capability of these traditional
tation challenge. Mask-RCNN extends Faster-RCNN by adding a mask methods, their accuracy cannot continuously increase with data/model
prediction branch in addition to bounding box regression and class scale. On the other hand, deep neural networks are highly scalable in
recognition. The very recent path aggregation network (PANet) [95] their learning capability when deeper network structures and larger data
enhances the feature hierarchy by a bottom-up path augmentation. With sets are adopted.
subtle computation overhead, it reaches the first place in the MS-COCO
challenge of instance segmentation task and also represents the
state-of-the-art on MVD and Cityscapes.
Although pixel-wise image segmentation is progressing rapidly with
superior accuracy, it is still far from practical usage, such as video se-
mantic segmentation, due to the high complexity of dense prediction.
Therefore, developing highly efficient image segmentation framework is
one of the grand challenges in the computer vision community.

3. Hardware implementation

The recent breakthroughs in developing computer vision algorithms


are not only driven by deep learning technologies and large-scale data-
sets, but also relying on the major leaps of hardware acceleration that
provides powerful parallel computing architectures to enable the effi-
cient training and inference of large-scale, complex and multi-layered
neural networks.
Hardware acceleration takes advantage of computer hardware to
perform computing tasks with lower latency and higher throughput than
the conventional software implementation running on general-purpose

Fig. 5. (a) Comparison between neural networks and other approaches in terms
Fig. 4. mIOU for different semantic segmentation methods on Pascal- of accuracy and scale (i.e. data/model size) [96], and (b) trade-off between
VOC 2012. flexibility and efficiency for different hardware implementations [97].

6
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

Practically, an operation can be computed faster in the hardware algorithmic parallelism in the following aspects [103]: (i) the convolu-
platform that is application-specifically designed and/or programmed. tion operation of an n  n matrix using a k  k kernel can be in parallel;
Hardware acceleration thus steps forward for heavy customization in (ii) the subsampling/pooling operation can be parallelized by executing
processing capability by allowing great parallelism, having specific data- different pooling operations separately; (iii) the activation of each
paths for temporal variants, and reducing the overhead of instruction neuron in a fully connected layer can be parallelized by creating a
control [98]. For decades, hardware customization in the form of GPUs, binary-tree multiplier. With great parallel processing structures and
FPGAs, and application-specific integrated circuits (ASICs) offer a strong floating-point capabilities, GPGPUs have been recognized to be a
promising path in trading flexibility for computation efficiency, as seen in good fit to accelerate deep learning. A number of GPU-based CNN li-
Fig. 5(b). In this section, we review a number of popular hardware braries have been developed to facilitate highly optimized CNN imple-
implementations including GPUs, FPGAs and other application-specific mentation on GPUs, including cuDNN [104], Cuda-convnet [105] and
accelerators. several other libraries built upon the popular deep learning frameworks,
such as Caffe [106], Torch [107], Tensorflow [108], etc.
Computational throughput, power consumption and memory effi-
3.1. Graphics processing units (GPUs)
ciency are three important metrics when implementing deep learning on
GPUs. Fig. 6 summarizes the peak performance of recent Nvidia GPUs for
GPUs were initially developed to accelerate graphics processing. A
single-precision floating-point (FP32) arithmetic measured by GFLOPs
GPU is particularly designed for integrated transform, lighting, triangle
and power consumption gauged by Thermal Design Power (TDP). The
setup/clipping, and rendering [100]. A modern GPU is not only a
GeForce 10 series, based on the most powerful GPU architecture “Nvidia
powerful graphics engine but also a highly parallelized computing pro-
Pascal”, is a set of consumer graphics cards released by Nvidia in 2016
cessor featuring high throughput and high memory bandwidth for
[110]. With an inexpensive GeForce GTX 1060, composed of 1280 CUDA
massive parallel algorithms, which is dubbed as GPU computing or
cores delivering 3855 GFLOPs in computational throughput, one can get
general-purpose computing on GPU (GPGPU).
into deep learning with affordable cost.
In contrast to multicore CPUs that are typically out-of-order, multi-
For professional usage, Titan V and Tesla V100 are much more
instructional, running at high frequencies and using large-size caches to
powerful and scalable than the GeForce 10 series based on the Pascal and
minimize the latency of a single-thread, GPGPUs consist of thousands of
a new Volta architecture, which integrates CUDA cores with the new
cores that are in-order, operating at lower frequencies and relying on
Tensor core technology. Tensor cores are especially designed for deep
smaller-size caches. To create high performance GPU-accelerated appli-
learning, which offer an extremely wide memory bus. Compared to
cations with parallel programming, a variety of development platforms
CUDA cores, they improve the peak performance by up to 12  for
such as compute unified device architecture (CUDA) [101] and open
training and up to 6  for inference. In addition to their high throughput,
computing language (OpenCL) [102], are studied and utilized for
Tensor cores allow efficient computation with 16-bit word-length,
GPU-accelerated embedded systems, desktop workstations, enterprise
implying that the amount of transferred data can be doubled over 32-
data centers, cloud-based platforms and high-performance computing
bit arithmetic with the same memory bandwidth.
(HPC) servers.
Nvidia Jetson is a leading low-power embedded platform that enables
A number of hardware vendors have produced GPUs. Among them,
server-grade computing performance on edge devices. Jetson TX2 is
Intel, Nvidia and AMD/ATI have been the market share leaders [100]. As
based on the 16 nm NVIDIA Tegra “Parker” system-on-a-chip (SoC),
shown in Fig. 6, the evolution of GPGPUs began in 2007, when Nvidia
which delivers 1 TFLOPs of throughput in a credit-card-sized module. A
released its CUDA development environment. A great variety of GPUs
new series of RTX gaming cards (i.e., RTX 2070/2080/2080Ti) with
have been designed for a specific usage, such as Nvidia GeForce GTX and
Turing architectures were unveiled in August 2018. They have Tensor
AMD Radeon HD GPUs for powerful gaming, and Nvidia Quadro and
cores on board and support unrestricted 16-bit floating-point (FP16), 8-
Titan X series for professional workstation [100]. More recently, the
bit integer (INT8) and 4-bit integer (INT4) arithmetic. Among them,
emergence of deep learning technology ushers in significant advances in
RTX 2080Ti offers the promising performance with more than 100
GPU computing.
TFLOPs in FP16.
Taking CNN as an example, it can take advantages of the nature of

Fig. 6. Computational throughput in terms of GFLOPS and power consumption in terms of TDP (watts) for single-precision floating-point arithmetic [109].

7
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

data centers [113] and embedded applications [114].


FPGA is a component that contains an array of programmable logic
blocks connected via a hierarchy of reconfigurable interconnects. Today's
FPGAs usually contain (i) digital signal processing units (DSPs) for
multiply-add-accumulation (MAC) operations, (ii) lookup tables (LUTs)
for combinatorial logic operations, and (iii) block RAMs for on-chip data
storage [10]. Fig. 7 shows a typical FPGA architecture to implement
DNNs [115]. It consists of memory data management (MDM) unit,
on-chip data management (ODM) unit, general matrix multiply (GEMM)
unit that is implemented by a set of processing elements (PEs) to perform
one or more MAC operations, and msic-layers (MLU) unit where ReLU
pooling and batch normalization are computed.
The FPGA architecture in Fig. 7 executes a DNN model by following
several major steps. It first fetches DNN weights and input feature maps
into an on-chip buffer (i.e. ODM) from MDM. Next, the GEMM unit
performs matrix operations and transfers the outcomes to MLU for ReLU/
Fig. 7. A typical FGPA architecture to implement DNNs [115]. batch normalization/pooling. The output of MLU goes to another ODM
unit and will be accessed by the subsequent convolutional and/or fully
When evaluating memory efficiency for GPUs, memory size and connected layers. If the on-chip buffer does not have sufficiently large
memory bandwidth are two important metrics. Today, it is common to capacity, the intermediate results must be temporarily stored in on-chip
have 11–12 GB memory on start-of-the-art gaming cards. Many Tesla or off-chip memory.
GPUs have 16 GB memory, and several Turing architecture Quadro Historically, an FPGA system is often specified at register-transfer
models have 24 GB memory (e.g. RTX 6000) or 64 GB memory (e.g. RTX level (RTL) by using hardware description language (HDL) such as Ver-
8000). For a given deep learning task, the peak performance of GPU is ilog or VHDL. This low-level design methodology needs substantial ef-
often far from the actual performance. In most practical applications, the forts and hardware expertise to carefully describe the detailed hardware
throughput of a GPU is about 15–20% of its peak performance [111]. It, architecture, including the massive concurrency between different
in turn, implies that evaluating DNNs is actually limited by memory hardware modules. Recently, high-level synthesis (HLS) tools have been
bandwidth, instead of computing power. Adopting GPUs with high successfully developed to facilitate efficient FPGA design by using high-
memory bandwidth is a valid strategy to speed up both training and level programming language such as C and Cþþ, and automatically
inference. For example, GTX Titan XP (10,709 FP32 GFLOPS and compile high-level description to generate low-level specification (i.e.
548 GB/s) can be faster than GTX 1080Ti (10,609 FP32 GFLOPS and HDL) [116]. With the aforementioned synthesis flow, the design cost of
484 GB/s) by up to 13% on bandwidth-limited tasks. GTX Titan V FPGA accelerators can be significantly reduced. However, there is an
(652.8 GB/s), Tesla V100 (900 GB/s) and P100 (720 GB/s) are even important tradeoff between RTL and HLS approaches in terms of design
faster than GTX Titan XP [109]. In addition to Nvidia, AMD has also cost and system performance.
released its high-end Vega GPUs with high memory bandwidth similar to In practice, a DNN model is often trained or fine-tuned on a high-
Titan V [100]. performance computing platform such as GPU, while FPGA accelera-
In order to address the fundamental issue of limited computing tion is implemented for DNN inference to process given input data based
throughput and memory bandwidth, multi-GPU systems allow single on the pre-trained DNN model. As illustrated in Tables 1 and 2, computer
machine and multi-GPUs or even distributed multi-system, multi-GPU vision algorithms based on DNNs are often associated with high
configurations. For a system with single machine and multi-GPUs computational workload (i.e., a large number of FLOPs) and large
working on separate tasks, one can directly access any available GPU memory storage (i.e., a large number of network parameters), while the
without coding in CUDA. On the other hand, for multi-GPUs working on memory bandwidth of FPGAs is often less than 10% of that of GPUs.
shared tasks such as training several models with different hyper- Hence, the grant challenge is to find an efficient mapping from the pre-
parameters, distributed training is needed. Nvidia has a collective com- trained complex DNN model to the limited hardware resources (i.e.,
munications library (NCCL) [112] that implements multi-GPU and high-density logic and memory blocks) offered by FPGAs. Such a tech-
multi-node collective communication primitives to make full use of all nical challenge has been tackled via hardware-friendly algorithmic op-
GPUs within and across multi-nodes with maximum bandwidth. timizations, including: (i) algorithmic operation, (ii) data-path
Distributed training is now supported by many popular deep learning optimization and (iii) model compression.
frameworks such as Tensorflow, Caffe, etc. These techniques reduce Algorithmic operation: Computational transforms, such as GEMM, fast
computational time linearly with the number of GPUs [112]. Fourier transform (FFT) and Winograd transform, may be applied to
feature maps and/or convolutional kernels to reduce the number of
arithmetic operations during inference. GEMM is a popular way of pro-
3.2. Field-programmable gate arrays (FPGAs) cessing DNNs in CPUs and GPUs, which vectorizes the computation of
convolutional and fully connected layers [117]. FFT casts 2D convolution
While GPUs have been demonstrated to offer extremely high to element-wise matrix multiplication, thereby reducing the arithmetic
throughput and are broadly used for hardware acceleration of DNNs, complexity. It is highly efficient for large kernel size (>5) because of the
they are often not preferred for energy/power-constrained applications, large number of convolutional operations between the feature maps and
such as IoT devices, due to their high power consumption. DNN accel- kernels [117,118]. For small kernels where FFT is not preferred, Wino-
eration thus moves towards an alternative solution based on energy- grad transform [119] provides an alternative way to reduce the number
efficient FPGAs. FPGA allows us to implement irregular parallelism, of multiplications by reusing the intermediate results. It can offer
customized data type and application-specific hardware architecture, 7.28  runtime speed-up compared to GEMM when running VGGNet on a
offering great flexibility to accommodate the recent DNN models that are Titan X GPU [118], and deliver the throughput of 46 GOPs when running
featured with increased sparsity and compact network structures. AlexNet on FGPA [120].
Further, FPGA can be reprogrammed after manufacturing for desired Data-path optimization: In order to fully exploit the parallelism, data
functions and applications. Due to these attractive features, a large path is optimized by unrolling the convolutional layers in CNNs and
number of FPGA-based accelerators have been proposed for both HPC mapping them onto a limit number of PEs. In early FPGA

8
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

implementations, PEs are arranged in a 2D grid as a systolic array different numerical parameters within the same network [133,134].
[121–123]. Because such a simple architecture limits the CNN kernel size When the dynamic quantization method is applied to both the convolu-
and does not offer data caching, it cannot achieve extremely high per- tional and fully-connected layers of AlexNet without fine-tuning, the
formance. Recently, loop optimization techniques, including loop reor- classification accuracy is almost unchanged (<1%) [135].
dering, unrolling, pipelining and tiling, have been proposed to address In the extreme case, a DNN may use binary weights and activations,
the aforementioned issue. Loop reordering tries to prevent redundant resulting in an extremely compact representation that is referred to as
memory access between loops to increase cache usage efficiency [124]. binary neural network (BNN) [115,133,136,137]. A BNN can be evalu-
Loop unrolling and pipelining maximize the utilization of FPGA re- ated with extremely low computational cost, as binary addition and
sources by exploring the parallelism of loop iterations [115,125,126]. multiplication can both be implemented with simple logic gates. Other
Loop tiling deals with the issue posed by insufficient on-chip memory of than BNN, a ternary DNN [138] sets its weights to þ1, 0 or 1, allowing
FPGAs. It partitions the feature maps and weights of each layer fetched each weight to be represented by 2 bits. On the other hand, the numerical
from memory into chunks, also referred to as tiles, to fit them into operations in all neurons are implemented with floating-point arithmetic
on-chip buffers [124,125,127]. (FP32).
Model compression: DNNs often carry a significant number of redun- Table 3 summarizes the performance of several typical CNN models
dant parameters and are mainly used for error-tolerant applications. deployed on FPGA using different model optimization methods. CNN
Hence, a lot of efforts have been made to simplify DNN models and, models based on quantized arithmetic are highly efficient in terms of
consequently, reduce the complexity of their hardware implementation. hardware utilization and power consumption; however, their accuracy is
These model compression methods can be broadly classified into three often compromised. On the other hand, CNN models based on low-rank
different categories: (i) pruning [115,128–131], (ii) low-rank approxi- approximation (e.g. SVD) and pruning carry a smaller number of weights,
mation [114,132], and (iii) quantization [114,121,133–137]. while simultaneously achieving high classification accuracy. The ternary
First, pruning is usually the first step to reduce model redundancy by ResNet [115], implemented with Intel StratixTM 10 FPGA [140], ach-
removing the least-important connections and/or parameters. Taking ieves a throughput of 12 TGOP/s, outperforming the throughput of Titan
CNN as an example, we can remove its weights that are extremely small X Pascal GPU by 10%.
[34] and/or cause high energy consumption [130]. After pruning, the To make a comprehensive comparison of the state-of-the-art hard-
CNN model is highly sparse and can be efficiently implemented with ware accelerators, we present the key performance metrics of several
FPGA by masking the zero weights for multiplications. Second, low-rank mainstream accelerators (i.e. CPU, GPU and FPGA) with different
approximation decomposes the weight matrix of a convolutional or fully network models (i.e. VGG and ResNet) in Table 4 [141]. The FPGA
connected layer to a set of low-rank filters that can be evaluated with low cluster in Table 4 is composed of 15 FPGA chips, as described in
computational cost [114]. Finally, because fixed-point arithmetic re- Ref. [142]. Two important observations can be made from the data in
quires less computational resources than floating-point arithmetic, Table 4. First, the throughput of FPGA is substantially higher than that of
feature maps, weight matrices and/or convolutional kernels can be CPU, but it is often lower than the throughput of GPU. Second, among
quantized by using a fixed-point representation to further reduce the FPGA, CPU and GPU, FPGA offers the highest energy efficiency.
computational cost.
A straightforward approach is to encode each numerical value with
the desired word-length according to its range [114,121]. Alternatively, 3.3. Application-specific hardware accelerators
a dynamic scheme may be adopted to assign different scaling factors to
A typical computer system is often heterogeneous, composed of a

Table 3
Performance comparison of FPGA-based CNN accelerators.
CNN Model FPGA Device Optimization Accuracy # of Param Computation Precision Frequency Throughput Power
Method (Top-5) (M) (GOP) (MHz) (GOP/s) (W)

VGG-19 [139] Arria10 – 90.1% 138 30.8 float32 370 866 41.7
GX1150
VGG-16 [114] Zynq 7Z045 SVD 87.96% 50.2 30.5 fixed16 150 137 9.6
VGG-16 [134] Arria10 Dynamic 88.1% 138 30.8 fixed8 150 645 –
GX1150
BNN: XNOR-Net Stratix5 Binary 66.8% 87.1 2.3 fixed1 150 1,964 26.2
[137] GSD8
Ternary ResNet Stratix10 Ternary, pruning 79.7% 61 1.4 float32 500 12,000 141.2
[115]

Table 4
Performance comparison of neural networks on difference hardware platforms [141].
Model Platform Device Precision Frequency. (MHz) Throughput (GOP/s) Energy Efficiency (GOP/j) Power (W)

VGG-19 CPU Xeon E5-2650v2 float32 2,600 119 0.63 95


GPU GTX TITAN X float32 1,002 1,704 6.82 250
Cluster w/15 FPGAs XC7 VX690T fixed16 – 1,220a 38.13 –
VGG-16 Cluster w/15 FPGAs XC7 VX690T fixed16 – 1,197a 37.88 –
FPGA Stratix-V GSD8 fixed8 120 117.8 19.1 –
ResNet-152 CPU Xeon E5-2650v2 float32 2,600 119 0.63 95
GPU GTX TITAN float32 1,002 1,661 0.60 250
FPGA Stratix 10 GX 2800 fixed8/16 300 789.44 – –
FPGA Arria10 GX1150 fixed8/16 240 697.09 – –
FPGA Arria10 GX1150 float16 150 315.5 – –
ReNet-50 FPGA Arria10 GX1150 float16 150 285.07 – –
FPGA Arria10 GX1150 fixed8/16 240 599.61 – –
a
Measured throughput value of a single FPGA.

9
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

heterogeneous suit of processors, such as CPUs added with other dis- [3] Richard Szeliski, Computer Vision: Algorithms and Applications, Springer Science
& Business Media, 2010.
similar processors, to meet the specific computing requirement. Pro-
[4] Wilson Geisler, Vision: A Computational Investigation into the Human
cessors that complement CPUs are known as application-specific Representation and Processing of Visual Information, Psyccritiques, 1983,
coprocessors. In addition to the notable coprocessors such as GPUs and pp. 581–582.
FPGAs, there are several specialized hardware units in the form of either [5] Laura McClure, Building an AI with the Intelligence of a Toddler: Fei-Fei Li at
TED2015, TED, 2015 [Online]. Available: https://blog.ted.com/building-an-ai
stand-alone devices or coprocessors that are particularly developed for -with-the-intelligence-of-a-toddler-fei-fei-li-at-ted2015/.
deep learning and/or other AI applications. [6] Deng Jia, et al., Imagenet: a large-scale hierarchical image database, in:
Tensor processing unit (TPU) [143,144], a customized ASIC devel- Conference on Computer Vision and Pattern Recognition, 2009.
[7] M. Sornam, Kavitha Muthusubash, V. Vanitha, A survey on image classification
oped by Google, is a stand-alone device specifically designed for neural and activity recognition using deep convolutional neural network architecture, in:
networks and tailored for the Google Tensorflow framework [108]. TPU International Conference on Advanced Computing, 2017.
targets a high volume of low-precision (e.g., 8-bit) arithmetic. It has [8] Li Liu, et al., Deep Learning for Generic Object Detection: a Survey, arXiv:
1809.02165, 2018.
already powered many applications at Google, such as the search engine [9] Alberto Garcia-Garcia, et al., A Review on Deep Learning Techniques Applied to
and AlphaGo [144]. Intel Nervana neural network processor (NNP) [145] Semantic Segmentation, arXiv:1704.06857, 2017.
is designed to provide the required flexibility of deep learning primitives [10] Sparsh Mittal, A survey of FPGA-based accelerators for convolutional neural
networks, Neural Comput. Appl. (2018) 1–31.
while making its core hardware components as efficient as possible. [11] Alex Krizhevsky, Geoffrey Hinton, Technical Report, Learning Multiple Layers of
Mobileye EyeQ [146] is a family of SoC devices specialized for vision Features from Tiny Images, vol. 1, University of Toronto, 2009. No. 4.
processing in autonomous driving. It shows the ability of handling [12] Yann LeCun, et al., Gradient-based learning applied to document recognition,
Proc. IEEE (1998) 2278–2324.
complex and computationally intensive vision tasks, while maintaining
[13] David Lowe, Object recognition from local scale-invariant features, in:
low power consumption. International Conference on Computer Vision, Vol.2, 1999.
Various AI coprocessors for mobile platforms are developed recently, [14] Navneet Dalal, Bill Triggs, Histograms of oriented gradients for human detection,
where an arms race seems around the corner. The mobile processor in: Conference on Computer Vision and Pattern Recognition, vol. 1, 2005.
[15] Aude Oliva, Antonio Torralba, Modeling the shape of the scene: a holistic
Qualcomm Snapdragon 845 contains a Hexagon 685 DSP core that representation of the spatial envelope, Int. J. Comput. Vis. (2001) 145–175.
supports sophisticated, on-device AI processing in camera, voice and [16] Fei-Fei Li, Pietro Perona, A bayesian hierarchical model for learning natural scene
gaming applications [147]. Imagination Technologies [148] develops a categories, in: Conference on Computer Vision and Pattern Recognition, 2005.
[17] Kostas Daniilidis, Patros Maragos, Nikos Paragios, Improving the Fisher kernel for
series of neural network accelerators (NNAs), such as PowerVR Series large-scale image classification, in: European Conference on Computer Vision,
2NX/3NX NNA. They are intellectual property (IP) cores that are 2010, pp. 143–156.
designed to deliver high performance computation and low power con- [18] Corinna Cortes, Vladimir Vapnik, Support-vector Networks, Machine Learning,
1995, pp. 273–297.
sumption for embedded and mobile devices. The new “neural engine” by [19] Jianchao Yang, et al., Linear spatial pyramid matching using sparse coding for
Apple [149], incorporating Apple A11/A12 Bionic SoC, is a pair of pro- image classification, in: Conference on Computer Vision and Pattern Recognition,
cessing cores dedicated for specific machine learning algorithms, 2009.
[20] Jorge Sanchez, Florent Perronnin, High-dimensional signature compression for
including Face ID, augmented reality, etc. HiSilicon Kirin 970 is the first large-scale image classification, in: Conference on Computer Vision and Pattern
mobile AI platform developed by HUAWEI [150]. With a dedicated Recognition, 2011.
neural processing unit (NPU), its new heterogeneous computing archi- [21] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
convolutional neural networks, Adv. Neural Inf. Process. Syst. (2012) 1097–1105.
tecture improves the throughput and energy efficiency by up to 25  and
[22] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for
50  respectively over a quad-core Cortex-A73 CPU cluster. Large-Scale Image Recognition, arXiv:1409.1556, 2014.
[23] Christian Szegedy, et al., Going deeper with convolutions, in: Conference on
4. Conclusions Computer Vision and Pattern Recognition, 2015.
[24] Kaiming He, et al., Deep residual learning for image recognition, in: Conference on
Computer Vision and Pattern Recognition, 2016.
As a scientific discipline, computer vision has been a challenging [25] Gao Huang, et al., Densely connected convolutional networks, in: Conference on
research area and received significant attention. With the emergence of Computer Vision and Pattern Recognition, 2017.
[26] Jie Hu, Li Shen, Gang Sun, Squeeze-and-excitation networks, in: Conference on
big data, advanced deep learning algorithms and powerful hardware Computer Vision and Pattern Recognition, 2017.
accelerators, modern computer vision systems have dramatically [27] Barret Zoph, et al., Learning transferable architectures for scalable image
evolved. In this paper, we conduct a comprehensive survey on computer recognition, in: Conference on Computer Vision and Pattern Recognition, 2018.
[28] B. Zoph, Q.V. Le, Neural architecture search with reinforcement learning, in:
vision techniques. Specially, we have highlighted the recent accom- International Conference on Learning Representations, 2017.
plishments in both the algorithms for a variety of computer vision tasks [29] Forrest Iandola, et al., Squeezenet: Alexnet-Level Accuracy with 50x Fewer
such as image classification, object detection and image segmentation, Parameters and < 0.5 Mb Model Size, arXiv:1602.07360, 2016.
[30] Andrew Howard, et al., Mobilenets: Efficient Convolutional Neural Networks for
and the promising hardware platforms to implement DNNs efficiently for Mobile Vision Applications, arXiv:1704.04861, 2017.
practical applications, such as GPUs, FPGAs and other new generation of [31] Xiangyu Zhang, et al., ShuffleNet: an extremely efficient convolutional neural
hardware accelerators. network for mobile devices, in: Conference on Computer Vision and Pattern
Recognition, 2018.
In the future, increasingly compact and efficient DNNs are needed for
[32] Bichen Wu, et al., Shift: a zero flop, zero parameter alternative to spatial
real-time and embedded applications. In addition, weakly supervised or convolutions, in: Conference on Computer Vision and Pattern Recognition, 2018.
unsupervised DNN schemes must be investigated to perceive all object [33] Weijie Chen, et al., All you need is a few shifts: designing efficient convolutional
categories in all open world scenes. Furthermore, highly energy-efficient neural networks for image classification, in: Conference on Computer Vision and
Pattern Recognition, 2019.
hardware engines are required to extend the existing accelerators to a [34] Song Han, Huizi Mao, William J. Dally, Deep compression: compressing deep
broad spectrum of challenging scenarios. To address the aforementioned neural networks with pruning, trained quantization and Huffman coding, in:
grand challenges, massive innovations of computer vision systems, in International Conference on Learning Representations, 2016.
[35] Shuo Yang, et al., From facial parts responses to face detection: a deep learning
terms of both algorithm developments and hardware designs, are ex- approach, in: International Conference on Computer Vision, 2015.
pected over the next five or even ten years. [36] Bart Thomee, et al., YFCC100M: the New Data in Multimedia Research, arXiv:
1503.01817, 2015.
[37] Aditya Khosla, et al., Novel dataset for fine-grained image categorization: stanford
References dogs, in: Conference on Computer Vision and Pattern Recognition Workshop on
Fine-Grained Visual Categorization, 2011.
[1] James Kobielus, Powering AI: the Explosion of New AI Hardware Accelerator, [38] K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes
InfoWord, 2018 [Online]. Available: https://www.infoworld.com/article/3 from videos in the wild, arXiv:1212.0402 (2012).
290104/powering-ai-the-explosion-of-new-ai-hardware-accelerators.html. [39] TREC Video Retrieval Evaluation Multimedia event detection. [Online]. Available:
[2] Computer vision, WikiPedia. [Online]. Available: https://en.wikipedia.org/w https://trecvid.nist.gov/.
iki/Computer_vision. [40] Herve Jegou, et al., Aggregating local descriptors into a compact image
representation, in: Conference on Computer Vision and Pattern Recognition, 2010.

10
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

[41] Zhongwen Xu, Yi Yang, G. Alex, Hauptmann, A discriminative CNN video [79] R. Giuly, M. Martone, M. Ellisman, Method: automatic segmentation of
representation for event detection, in: Conference on Computer Vision and Pattern mitochondria utilizing patch classification, contour pair classification, and
Recognition, 2015. automatically seeded level sets, BMC Bioinf. 13 (1) (2012) 29.
[42] Hongteng Xu, Y. Zhen, H. Zha, Trailer generation via a point process-based visual [80] Holger Roth, et al., Deeporgan: multi-level deep convolutional networks for
attractiveness model, in: International Joint Conference on Artificial Intelligence, automated pancreas segmentation, in: International Conference on Medical Image
2015. Computing and Computer-Assisted Intervention, 2015.
[43] Du Tran, et al., Learning spatiotemporal features with 3D convolutional networks, [81] Jonathan Long, Evan Shelhamer, Trevor Darrell, Fully convolutional networks for
in: International Conference on Computer Vision, 2015. semantic segmentation, in: Conference on Computer Vision and Pattern
[44] Karen Simonyan, Andrew Zisserman, Two-stream convolutional networks for Recognition, 2015.
action recognition in videos, Adv. Neural Inf. Process. Syst. (2014) 568–576. [82] Vijay Badrinarayanan, Alex Kendall, Roberto Cipolla, Segnet: a deep convolutional
[45] Jeff Donahue, et al., Long-term recurrent convolutional networks for visual encoder-decoder architecture for image segmentation, Trans. Pattern Anal. Mach.
recognition and description, Trans. Pattern Anal. Mach. Intell. (2014) 677–691. Intell. (2017) 2481–2495.
[46] Zuxuan Wu, et al., Modeling spatial-temporal clues in a hybrid deep learning [83] Olaf Ronneberger, Philipp Fischer, Thomas Brox, U-net: convolutional networks
framework for video classification, in: International Conference on Multimedia, for biomedical image segmentation, in: International Conference on Medical
2015. Image Computing and Computer-Assisted Intervention, 2015.
[47] Shikhar Sharma, Kiros Ryan, Ruslan Salakhutdinov, Action Recognition Using [84] Guosheng Lin, et al., RefineNet: multi-path refinement networks for high-
Visual Attention, arXiv:1511.04119, 2015. resolution semantic segmentation, in: Conference on Computer Vision and Pattern
[48] Z. Li, et al., Video LSTM convolves, attends and flows for action recognition, Recognition, 2017.
Comput. Vis. Image Understand. 166 (2018) 41–50. [85] Liang Chieh Chen, et al., Semantic image segmentation with deep convolutional
[49] Pedro F. Felzenszwalb, et al., Object detection with discriminatively trained part- nets and fully connected CRFs, Comput. Sci. (2014) 357–361.
based models, Trans. Pattern Anal. Mach. Intell. (2010) 1627–1645. [86] Shuai Zheng, et al., Conditional random fields as recurrent neural networks, in:
[50] Viola Paul, Michael J. Jones, Robust real-time face detection, Int. J. Comput. Vis. International Conference on Computer Vision, 2015.
(2004) 137–154. [87] Liang-Chieh Chen, et al., Deeplab: semantic image segmentation with deep
[51] Timo Ahonen, Abdenour Hadid, Matti Pietikainen, Face description with local convolutional nets, atrous convolution, and fully connected CRFs, Trans. Pattern
binary patterns: application to face recognition, Trans. Pattern Anal. Mach. Intell. Anal. Mach. Intell. (2018) 834–848.
(2006) 2037–2041. [88] Yu Fisher, Vladlen Koltun, Multi-scale Context Aggregation by Dilated
[52] Mark Everingham, et al., The pascal visual object classes (voc) challenge, Int. J. Convolutions, arXiv:1511.07122, 2015.
Comput. Vis. (2010) 303–338. [89] Hengshuang Zhao, et al., Pyramid scene parsing network, in: Conference on
[53] Tsung-Yi Lin, et al., Microsoft coco: common objects in context, in: European Computer Vision and Pattern Recognition, 2017.
Conference on Computer Vision, 2014. [90] Liang-Chieh Chen, et al., Encoder-decoder with atrous separable convolution for
[54] Ross Girshick, et al., Rich feature hierarchies for accurate object detection and semantic image segmentation, in: European Conference on Computer Vision,
semantic segmentation, in: Conference on Computer Vision and Pattern 2018.
Recognition, 2014. [91] François Chollet, Xception: deep learning with depthwise separable convolutions,
[55] Jasper Uijlings, et al., Selective search for object recognition, Int. J. Comput. Vis. in: Conference on Computer Vision and Pattern Recognition, 2017.
(2013) 154–171. [92] P.O. Pinheiro, R. Collobert, P. Dollar, Learning to segment object candidates, Adv.
[56] Kaiming He, et al., Spatial pyramid pooling in deep convolutional networks for Neural Inf. Process. Syst. (2015) 1990–1998.
visual recognition, Trans. Pattern Anal. Mach. Intell. (2015) 1904–1916. [93] Jifeng Dai, et al., Instance-sensitive fully convolutional networks, in: European
[57] Ross Girshick, Fast R-CNN, in: International Conference on Computer Vision, Conference on Computer Vision, 2016.
2015. [94] J. Dai, et al., R-FCN: object detection via region-based fully convolutional
[58] Shao Ren, Kai He, Ross Girshick, et al., Faster R-CNN: towards real-time object networks, Adv. Neural Inf. Process. Syst. (2016) 379–387.
detection with region proposal networks, Trans. Pattern Anal. Mach. Intell. (2017) [95] Shu Liu, et al., Path aggregation network for instance segmentation, in:
1137–1149. Conference on Computer Vision and Pattern Recognition, 2018.
[59] Kaiming He, Georgia Gkioxari, Piotr Dollar, et al., Mask R-CNN, in: International [96] Jeff Dean, Keynote: Recent Advances in Artificial Intelligence via Machine
Conference on Computer Vision, 2017. Learning and the Implications for Computer System Design, Hotchips, 2017.
[60] Redmon Joseph, Santosh Divvala, Ross Girshick, et al., You only look once: [97] Shaaban, “Advanced computer architecture: digital signal processing (DSP):
unified, real-time object detection, in: Conference on Computer Vision and Pattern architecture & processors,” Lecture EECC722, [Online]. Available:http://meseec.
Recognition, 2016. ce.rit.edu/eecc722-fall2012/722-10-10-2012.pdf.
[61] Wei Liu, Dragomir Anguelov, Dumitru Erhan, et al., SSD: single shot multibox [98] Hardware acceleration, WikiPedia. [Online]. Available: https://en.wikipedia
detector, in: European Conference on Computer Vision, 2016. .org/wiki/Hardware_acceleration.
[62] Redmon Joseph, Farhadi Ali, YOLO9000: better, faster, stronger, in: Conference [99] S. Mittal, J.S. Vetter, A survey of CPU-GPU heterogeneous computing techniques,
on Computer Vision and Pattern Recognition, 2017. ACM Comput. Surv. 47 (4) (2015) 69.
[63] Redmon Joseph, Farhadi Ali, Yolov3: an Incremental Improvement, arXiv: [100] Graphics processing unit, WikiPedia. [Online]. Available: https://en.wikipedi
1804.02767, 2018. a.org/wiki/Graphics_processing_unit.
[64] Hu Han, et al., Relation networks for object detection, in: Conference on Computer [101] Michael Garland, et al., Parallel computing experiences with CUDA, IEEE Micro
Vision and Pattern Recognition, 2018. (2008) 13–27.
[65] Yancheng Bai, et al., SOD-MTGAN: small object detection via multi-task [102] John E. Stone, D. Gohara, G. Shi, OpenCL: a parallel programming standard for
generative adversarial network, in: European Conference on Computer Vision, heterogeneous computing systems, Comput. Sci. Eng. (2010) 66–73.
2018, pp. 8–14. [103] Magnus Halvorsen, Hardware Acceleration of Convolutional Neural Networks, MS
[66] Yancheng Bai, et al., Finding tiny faces in the wild with generative adversarial thesis, Norwegian University of Science Technology, 2015.
network, in: Conference on Computer Vision and Pattern Recognition, 2018. [104] Sharan Chetlur, et al., CUDNN: Efficient Primitives for Deep Learning, arXiv:
[67] Alex Bewley, et al., Simple online and realtime tracking with a deep association 1410.0759, 2014.
metric, in: International Conference on Image Processing, 2017. [105] Alex Krizhevsky, Cudaconvet2. [Online]. Available: https://code.google
[68] Alex Bewley, et al., Simple online and realtime tracking, in: International .com/archive/p/cuda-convnet2/.
Conference on Image Processing, 2016. [106] Yangqing Jia, et al., Caffe: convolutional architecture for fast feature embedding,
[69] Xizhou Zhu, et al., Flow-Guided feature aggregation for video object detection, in: in: International Conference on Multimedia, 2014.
International Conference on Computer Vision, 2017. [107] Ronan Collobert, Koray Kavukcuoglu, Clement Farabet, Torch7: a matlab-like
[70] Xizhou Zhu, et al., Deep feature flow for video recognition, in: Conference on environment for machine learning, in: Conference on Neural Information
Computer Vision and Pattern Recognition, 2017. Processing System Workshop, 2011.
[71] Jonathan Huang, et al., Speed/Accuracy Trade-Offs for Modern Convolutional [108] TensorFlow, URL: https://www.tensorflow.org/.
Object Detectors, 2017 [Online]. Available: http://image-net.org/challenges/t [109] Grigory Sapunov, Hardware for Deep Learning, Intento, 2018 [Online]. Available:
alks_2017/Imagenet2017VID.pdf. https://blog.inten.to/hardware-for-deep-learning-current- state-and-trends-51c
[72] Ron Ohlander, Keith Price, D. Raj Reddy, Picture segmentation using a recursive 01ebbb6dc.
region splitting method, Comput. Graph. Image Process. (1978) 313–333. [110] PASCAL GPU Architecture, URL: https://www.nvidia.com/en-us/data-center/pa
[73] Pedro F. Felzenszwalb, Daniel P. Huttenlocher, Efficient graph-based image scal-gpu-architecture/.
segmentation, Int. J. Comput. Vis. (2004) 167–181. [111] Eugenio Culurciello, Computation and Memory Bandwidth in Deep Neural
[74] Wenxian Yang, et al., User-friendly interactive image segmentation through Networks, A Medium Corporation, 2017 [Online]. Available: https://medium.co
unified combinatorial user inputs, Trans. Image Process. (2010) 2470–2479. m/@culurciello/computation-and-memory-bandwidth-in-deep-neural-netwo
[75] Alberto Garcia-Garcia, et al., A Review on Deep Learning Techniques Applied to rks-16cbac63ebd5.
Semantic Segmentation, arXiv:1704.06857, 2017. [112] NVIDIA Collective Communications Library (NCCL), URL: https://develop
[76] Mark Everingham, et al., The pascal visual object classes challenge: a er.nvidia.com/nccl.
retrospective, Int. J. Comput. Vis. (2015) 98–136. [113] Kalin Ovtcharov, et al., Accelerating Deep Convolutional Neural Networks Using
[77] Marius Cordts, et al., The cityscapes dataset for semantic urban scene Specialized Hardware, Microsoft Research Whitepaper, 2015, pp. 1–4.
understanding, in: Conference on Computer Vision and Pattern Recognition, 2016. [114] Jiantao Qiu, et al., Going deeper with embedded fpga platform for convolutional
[78] Gerhard Neuhold, et al., The mapillary vistas dataset for semantic understanding neural network, in: International Symposium on Field-Programmable Gate Arrays,
of street scenes, in: International Conference on Computer Vision, 2017. 2016.

11
X. Feng et al. Integration, the VLSI Journal xxx (xxxx) xxx

[115] Eriko Nurvitadhi, et al., Can FPGAs beat GPUs in accelerating next-generation [134] Yufei Ma, et al., Optimizing loop operation and dataflow in FPGA acceleration of
deep neural networks?, in: International Symposium on Field-Programmable Gate deep convolutional neural networks, in: International Symposium on Field-
Arrays, 2017. Programmable Gate Arrays, 2017.
[116] Ritchie Zhao, et al., Accelerating binarized convolutional neural networks with [135] Naveen Suda, et al., Throughput-optimized OpenCL-based FPGA accelerator for
software-programmable FPGAs, in: International Symposium on Field- large-scale convolutional neural networks, in: International Symposium on Field-
Programmable Gate Arrays, 2017. Programmable Gate Arrays, 2016.
[117] Jeremy Bottleson, et al., CLCAFFE: OpenCL accelerated CAFFE for convolutional [136] Matthieu Courbariaux, et al., Binarized Neural Networks: Training Deep Neural
neural networks, in: International Parallel and Distributed Processing Symposium Networks with Weights and Activations Constrained Toþ 1 Or-1, arXiv:
Workshops, 2016. 1602.02830, 2016.
[118] Andrew Lavin, Scott Gray, Fast algorithms for convolutional neural networks, in: [137] Shuang Liang, et al., FP-BNN: binarized neural network on FPGA, Neurocomputing
Conference on Computer Vision and Pattern Recognition, 2016. (2018) 1072–1086.
[119] Shmuel Winograd, Arithmetic Complexity of Computations, vol. 33, Society for [138] Fengfu Li, B. Zhang, B. Liu, Ternary Weight Networks, arXiv:1605.04711vol. 2,
Industrial and Applied Mathmatics, 1980. 2016.
[120] Roberto DiCecco, et al., Caffeinated FPGAs: FPGA Framework for Convolutional [139] Jialiang Zhang, Li Jing, Improving the performance of OpenCL-based FPGA
Neural Networks, Field-Programmable Technology, 2016. accelerator for convolutional neural network, in: International Symposium on
[121] Murugan Sankaradas, et al., A Massively Parallel Coprocessor for Convolutional Field-Programmable Gate Arrays, 2017.
Neural Networks, Application-specific Systems, Architectures and Processors, [140] Intel FPGA, Intel® Stratix® 10 Variable Precision DSP Blocks User Guide,
2009. Technical report, Intel FPGA Group, 2017.
[122] Srimat Chakradhar, et al., A dynamically configurable coprocessor for [141] Teng Wang, et al., A Survey of FPGA Based Deep Learning Accelerators:
convolutional neural networks, Comput. Architect. News (2010) 247–257. Challenges and Opportunities, arXiv:1901.04988, 2018.
[123] Clement Farabet, et al., Neuflow: a runtime reconfigurable dataflow processor for [142] Tong Geng, et al., A framework for acceleration of CNN training on deeply-
vision, in: Conference on Computer Vision and Pattern Recognition, 2011. pipelined FPGA clusters with work and weight load balancing, in: Conference on
[124] Chen Zhang, et al., Optimizing FPGA-based accelerator design for deep Field Programmable Logic and Applications, 2018.
convolutional neural networks, in: International Symposium on Field- [143] Joe Osborne, Google's Tensor Processing Unit Explained: This Is what the Future
Programmable Gate Arrays, 2015. Computing Looks like, TechRadar, 2016 [Online]. Available: https://www.techr
[125] Atul Rahman, et al., Design Space Exploration of FPGA Accelerators for adar.com/news/computing-components/processors/google-s-tensor-processing-u
Convolutional Neural Networks, Design, Automation & Test in Europe, 2017. nit-explained-this-is-what-the-future-of-computing-looks-like-1326915.
[126] Y. Li, et al., A GPU-outperforming FPGA accelerator architecture for binary [144] Norm Jouppi, Google Supercharges Machine Learning Tasks with TPU Custom
convolutional neural networks, J. Emerg. Technol. Comput. Syst. 14 (2) (2018) 18. Chip, Google Could, 2017 [Online]. Available: https://cloud.google.com/blog/p
[127] Steven Derrien, Sanjay Rajopadhye, Loop tiling for reconfigurable accelerators, in: roducts/gcp/google-supercharges-machine-learning-tasks-with-custom-chip.
Conference on Field Programmable Logic and Applications, 2001. [145] Naveen Rao, Intel Nervana Neural Network Processor (NNP) Redefine AI Silicon,
[128] Baoyuan Liu, et al., Sparse convolutional neural networks, in: Conference on Intel Website, 2017 [Online]. Available:
Computer Vision and Pattern Recognition, 2015. https://www.intel.ai/intel-nervana-neural-network-proc
[129] Xiaofan Zhang, et al., High-performance video content recognition with long-term essors-nnp-redefine-ai-silicon/#gs.1s250i.
recurrent convolutional network for FPGA, in: Conference on Field Programmable [146] MobilEYE, The Evolution of EyeQ, Mobileye website, 2018 [Online]. Available:
Logic and Applications, 2017. https://www.mobileye.com/our-technology/evolution-eyeq- chip/.
[130] Tien-Ju Yang, Yu-Hsin Chen, Vivienne Sze, Designing energy-efficient [147] Qualcomm, "Snapdragon 845 Mobile Platform," URL: https://www.qua
convolutional neural networks using energy-aware pruning, in: Conference on lcomm.com/products/snapdragon-845-mobile-platform.
Computer Vision and Pattern Recognition, 2017. [148] Imagination Technology, "PowerVR vision & AI," URL: https://www.imgtec.com
[131] P. Adam, et al., SPARCNet: a hardware accelerator for efficient deployment of /vision-ai/.
sparse convolutional networks, J. Emerg. Technol. Comput. Syst. 13 (3) (2017) 31. [149] James Vincent, The iPhone X's New Neural Engine Exemplifies Apple's Approach
[132] Roberto Rigamonti, et al., Learning separable filters, in: Conference on Computer to AI, The Verge, 2017 [Online]. Available: https://www.theverge.com/2017/9/1
Vision and Pattern Recognition, 2013. 3/16300464/apple-iphone-x-ai-neural-engine.
[133] Matthieu Courbariaux, Yoshua Bengio, Jean-Pierre David, Training Deep Neural [150] HUAWEI, HUAWEI Reveals the Future of Mobile AI and IFA 2017, HUAWEI
Networks with Low Precision Multiplications, arXiv:1412.7024, 2014. website, 2017 [Online]. Available: https://www.huawei.com/en/press-events
/news/2017/9/mobile-ai-ifa-2017.

12

You might also like