Deep Learning Processors For Intelligent IoT Devices

In just a short few years, AI/DL/RL/ML have become important tools for many industries and we're now in a rapid innovation cycle.


The Growing Demand For Deep Learning Processors

In the past few years, the Artificial Intelligence field has entered a high growth phase, driven largely by advancements in Machine Learning methodologies like Deep Learning (DL) and Reinforcement Learning (RL). Combinations of those techniques demonstrate unprecedented performance in solving a wide range of problems, from playing Go at super-human level to diagnosing cancer like a specialist.

In our previous blogs, Intelligent IoT and Fog Computing Trends and The Rise of Ubiquitous Computer Vision In IoT, we talked about some interesting use cases of DL in IoT. The applications will be both broad and deep. They are going to fuel the demand for new breeds of processors in coming decades.

Deep Learning Workflow Overview

DL/RL innovations are happening at an astonishing pace (thousands of papers with new algorithms are presented in numerous AI related conferences every year). Though it is premature to predict the final winning solutions, hardware companies are racing to build processors, tools, and frameworks. They are trying to identify pain points and bottlenecks in DL workflows (Fig. 1), leveraging years of experience of researchers.

Deep Learning Workflow

Fig. 1: Basic Deep Learning Workflow

Platforms For Training DL Models

Let’s start with training platforms. Graphical Processing Units (GPU) based systems are usually the choice for training advanced DL models. Nvidia has long realized the advantages of using GPU for general purpose high performance computing.

GPU has hundreds of compute cores that support a large number of hardware threads and high throughput floating point computations. Nvidia developed Compute Unified Device Architecture (CUDA) programming framework to make GPU friendly for scientists and machine learning experts to use.

CUDA toolchain has improved overtime, providing researchers a flexible and friendly way to realize highly complex algorithms. A few years ago, Nvidia aptly identified the DL opportunity and persistently developed CUDA support for most of DL operations. Standard frameworks like Caffe, Torch, and Tensorflow all support CUDA.

In cloud services like AWS, developers have a choice between using CPU or GPU (more specifically Nvidia GPU). Platform choice depends on the complexity of the neural networks, budget, and time. GPU based systems can usually cut the training time by several times over CPU but are more expensive (Fig. 2)

Fig. 2: AWS EC2 GPU Instances

Alternatives to GPU / CPU

Alternatives are coming. Khronos proposed OpenCL in 2009 which is an open standard for parallel computing on a wide range of hardwares like CPU, GPU, DSP or FPGA. It will enable other processors like AMD GPUs to enter DL training market, providing developers with more choices.

However, it is still behind CUDA in DL library support. Hopefully, that situation will improve in the next few years. Intel is also developing processors customized for DL training through its Nervana acquisition.

Competitive Landscape of DL Inference

DL inference is a very competitive market. Applications can be deployed at multiple levels, usually depending on the requirements of the use cases:

  • Cloud / Enterprise: Image classifications, Cybersecurity, Text Analytics, NLP, etc.
  • Smart Gateways: Biometrics, Speech Recognition, Smart Agent, etc.
  • Edge endpoints: Mobile devices, Smart cameras, etc.

Cloud Inference

Cloud inference market will see tremendous growth, with a strong push from internet giants like Google, Facebook, Baidu, or Alibaba. For example, Google Cloud and Microsoft Azure offer very strong image classification, natural language processing, and face recognition APIs that developers can easily integrate into their cloud applications.

Cloud inference platforms will need to support millions of simultaneous users reliably. The ability to scale the throughput is critical. Besides, cutting down energy consumption is another top priority in order to control operating cost of their services.

On cloud inference space, in addition to GPUs, data centers are using FPGA or customized processors to make cloud inference applications more cost effective and power efficient. For example, Microsoft Project Brainwave uses Intel FPGAs to demonstrate strong performance and flexibilities in running DL algorithms like CNN, LSTM, etc.

Fig. 3: Intel 14nm Stratix FPGA

FPGAs have advantages. The hardware logics, compute kernels, and memory configurations are customizable for a specific type of neural network, making it more efficient in tackling a pre-trained model. However, one drawback is the difficulty of programming compared to CPU or CUDA. As mentioned in the previous section, OpenCL will be helpful in making FPGA more software developer friendly.

Besides FPGA, Google is also making a customized processor called TPU. It is an ASIC that focus on highly efficient matrix calculations. However, it is only supported within Google’s own services.

Here are some of the players in DL cloud inference.

Customized DL processorsGoogle TPU

Intel Nervana

GPU with DL acceleratorNvidia Volta (V100)
  • Added Tensor Core supporting common matrix operations
FPGAXilinx, Intel


Embedded DL Inference For Intelligent Edge Computing

On the edge, DL inference solutions need to address a diverse set of requirements for different use cases and markets.

Autonomous Driving Platforms

Autonomous vehicle platforms are currently the hottest market where the state-of-the-art DL and RL methods are being applied to achieve the highest level of autonomous driving. Nvidia has been leading the market with several classes of DL SoCs from Tegra to Xavier.  For example, Xavier SoC is built into Nvidia’s Drive PX platforms that can achieve up to 320 TFLOP. It is going to target level 5 autonomous driving.

Mobile Processors

Another rapid growth area is mobile application processors. DL enables new features on smartphones that were not possible before. One example is Apple’s neural engine integration into A11 Bionic chip, which enables it to add high accuracy face locking on the iPhone X.

Chinese chipmaker HiSilicon has also released its Kirin 970 processor which features a Neural Processing Unit (NPU). Some of Huawei’s latest smartphones (Fig. 4) are already designed with the new DL processors. For example, using the NPU, the smartphone camera “knows” what it is looking at and adjusts the camera settings automatically depending on the subject of the scene (e.g. human, plants, landscape, etc).

Fig. 4: Huawei Mate 10 Pro – Subject Aware Camera

The following tables list some of the processors for DL inference applications.

NvidiaTegraJetson TX1, TX2
Xavier Volta architecture with Tensor Cores specifically for DL operations. Drive PX Xavier, PX Pegasus Platforms
IntelMovidius MyriadVision Processing Unit (VPU) targeting computer vision for drone, robotics, etc.
MobileEyeMobileEye EyeQ is specifically built for autonomous driving market
QualcommSnapdragon 600/800Neural Network Engine SDK uses Hexagon DSP + Adrenu GPU for building efficient DL inference for edge devices
SamsungExynos 9 Series 9810Target smartphones: e.g. Galaxy S9
HiSilicon/HuaweiKirin 970 Target smartphones: e.g. Huawei Mate and Honor
RockchipRK 3399ProTarget security monitoring, drones, etc.
MediatekHelio P and X seriesTarget smartphones: e.g. Oppo and Meizu.

New Architectures

It is worth mentioning that there is a new category of processors, called neuromorphic processors, which closely mimic the mechanism of neurons and synapses of human brains. They can realize a type of neural network called Spiking Neural Network (SNN) which learns in both the spatial and temporal domains.

IoT For All Newsletter
Sign up for our weekly newsletter and exclusive content!


In principle, they are much more power efficient compared to existing DL architectures and have advantages in tackling online machine learning problems.  

IBM’s TrueNorth and Intel’s Loihi are based on neuromorphic architecture. Researchers are exploring the capabilities of the chips, showing some potential. It is unclear when the new types of processors will be ready for broad commercial use.  A number of startups like Applied Brain Research and Brainchip are also focusing on this area, developing tools and IPs.

Fig. 5: Intel Loihi

It’s an Interesting Time

In just a short few years, AI/DL/RL/ML have become important tools for many industries. The underlying ecosystem, from IPs, processors, system designs to toolchains and software methodologies, has entered a rapid innovation cycle. New processors will enable many new IoT use cases which were not feasible before.

However, IoT and Machine Learning use cases are still evolving. it will take generations of processors for chip designers and developers to come up with the right mix of architecture in addressing the needs of various markets. We will take a deeper look into compute platforms for various verticals in future articles.