Deep Learning Processors For Intelligent IoT Devices

- Last Updated: December 2, 2024

Frank Lee
- Last Updated: December 2, 2024



In the past few years, the Artificial Intelligence field has entered a high growth phase, driven largely by advancements in Machine Learning methodologies like Deep Learning (DL) and Reinforcement Learning (RL). Combinations of those techniques demonstrate unprecedented performance in solving a wide range of problems, from playing Go at super-human level to diagnosing cancer like a specialist.
In our previous blogs, Intelligent IoT and Fog Computing Trends and The Rise of Ubiquitous Computer Vision In IoT, we talked about some interesting Applications of DL in IoT. The applications will be both broad and deep. They are going to fuel the demand for new breeds of processors in coming decades.
DL/RL innovations are happening at an astonishing pace (thousands of papers with new algorithms are presented in numerous AI related conferences every year). Though it is premature to predict the final winning solutions, hardware companies are racing to build processors, tools, and frameworks. They are trying to identify pain points and bottlenecks in DL workflows (Fig. 1), leveraging years of experience of researchers.

Fig. 1: Basic Deep Learning Workflow
Let’s start with training platforms. Graphical Processing Units (GPU) based systems are usually the choice for training advanced DL models. Nvidia has long realized the advantages of using GPU for general purpose high performance computing.
GPU has hundreds of compute cores that support a large number of hardware threads and high throughput floating point computations. Nvidia developed Compute Unified Device Architecture (CUDA) programming framework to make GPU friendly for scientists and machine learning experts to use.
CUDA toolchain has improved overtime, providing researchers a flexible and friendly way to realize highly complex algorithms. A few years ago, Nvidia aptly identified the DL opportunity and persistently developed CUDA support for most of DL operations. Standard frameworks like Caffe, Torch, and Tensorflow all support CUDA.
In cloud services like AWS, developers have a choice between using CPU or GPU (more specifically Nvidia GPU). Platform choice depends on the complexity of the neural networks, budget, and time. GPU based systems can usually cut the training time by several times over CPU but are more expensive (Fig. 2)

Fig. 2: AWS EC2 GPU Instances
Alternatives are coming. Khronos proposed OpenCL in 2009 which is an open standard for parallel computing on a wide range of hardwares like CPU, GPU, DSP or FPGA. It will enable other processors like AMD GPUs to enter DL training market, providing developers with more choices.
However, it is still behind CUDA in DL library support. Hopefully, that situation will improve in the next few years. Intel is also developing processors customized for DL training through its Nervana acquisition.
DL inference is a very competitive market. Applications can be deployed at multiple levels, usually depending on the requirements of the Applications:
Cloud inference market will see tremendous growth, with a strong push from internet giants like Google, Facebook, Baidu, or Alibaba. For example, Google Cloud and Microsoft Azure offer very strong image classification, natural language processing, and face recognition APIs that developers can easily integrate into their cloud applications.
Cloud inference platforms will need to support millions of simultaneous users reliably. The ability to scale the throughput is critical. Besides, cutting down energy consumption is another top priority in order to control operating cost of their services.
On cloud inference space, in addition to GPUs, data centers are using FPGA or customized processors to make cloud inference applications more cost effective and power efficient. For example, Microsoft Project Brainwave uses Intel FPGAs to demonstrate strong performance and flexibilities in running DL algorithms like CNN, LSTM, etc.

Fig. 3: Intel 14nm Stratix FPGA
FPGAs have advantages. The hardware logics, compute kernels, and memory configurations are customizable for a specific type of neural network, making it more efficient in tackling a pre-trained model. However, one drawback is the difficulty of programming compared to CPU or CUDA. As mentioned in the previous section, OpenCL will be helpful in making FPGA more software developer friendly.
Besides FPGA, Google is also making a customized processor called TPU. It is an ASIC that focus on highly efficient matrix calculations. However, it is only supported within Google’s own services.
Here are some of the players in DL cloud inference.
| Categories | Processors | Remarks | 
| Customized DL processors | Google TPU 
 Intel Nervana | 
 | 
| GPU with DL accelerator | Nvidia Volta (V100) | 
 | 
| FPGA | Xilinx, Intel | 
On the edge, DL inference solutions need to address a diverse set of requirements for different Applications and markets.
Autonomous vehicle platforms are currently the hottest market where the state-of-the-art DL and RL methods are being applied to achieve the highest level of autonomous driving. Nvidia has been leading the market with several classes of DL SoCs from Tegra to Xavier. For example, Xavier SoC is built into Nvidia's Drive PX platforms that can achieve up to 320 TFLOP. It is going to target level 5 autonomous driving.
Another rapid growth area is mobile application processors. DL enables new features on smartphones that were not possible before. One example is Apple’s neural engine integration into A11 Bionic chip, which enables it to add high accuracy face locking on the iPhone X.
Chinese chipmaker HiSilicon has also released its Kirin 970 processor which features a Neural Processing Unit (NPU). Some of Huawei’s latest smartphones (Fig. 4) are already designed with the new DL processors. For example, using the NPU, the smartphone camera "knows" what it is looking at and adjusts the camera settings automatically depending on the subject of the scene (e.g. human, plants, landscape, etc).

Fig. 4: Huawei Mate 10 Pro - Subject Aware Camera
The following tables list some of the processors for DL inference applications.
| Company | Chip | Remarks | 
| Nvidia | Tegra | Jetson TX1, TX2 | 
| Xavier | Volta architecture with Tensor Cores specifically for DL operations. Drive PX Xavier, PX Pegasus Platforms | |
| Intel | Movidius Myriad | Vision Processing Unit (VPU) targeting computer vision for drone, robotics, etc. | 
| MobileEye | MobileEye EyeQ is specifically built for autonomous driving market | |
| Qualcomm | Snapdragon 600/800 | Neural Network Engine SDK uses Hexagon DSP + Adrenu GPU for building efficient DL inference for edge devices | 
| Samsung | Exynos 9 Series 9810 | Target smartphones: e.g. Galaxy S9 | 
| HiSilicon/Huawei | Kirin 970 | Target smartphones: e.g. Huawei Mate and Honor | 
| Rockchip | RK 3399Pro | Target security monitoring, drones, etc. | 
| Mediatek | Helio P and X series | Target smartphones: e.g. Oppo and Meizu. | 
It is worth mentioning that there is a new category of processors, called neuromorphic processors, which closely mimic the mechanism of neurons and synapses of human brains. They can realize a type of neural network called Spiking Neural Network (SNN) which learns in both the spatial and temporal domains.
In principle, they are much more power efficient compared to existing DL architectures and have advantages in tackling online machine learning problems.
IBM’s TrueNorth and Intel’s Loihi are based on neuromorphic architecture. Researchers are exploring the capabilities of the chips, showing some potential. It is unclear when the new types of processors will be ready for broad commercial use. A number of startups like Applied Brain Research and Brainchip are also focusing on this area, developing tools and IPs.

Fig. 5: Intel Loihi
In just a short few years, AI/DL/RL/ML have become important tools for many industries. The underlying ecosystem, from IPs, processors, system designs to toolchains and software methodologies, has entered a rapid innovation cycle. New processors will enable many new IoT Applications which were not feasible before.
However, IoT and Machine Learning Applications are still evolving. it will take generations of processors for chip designers and developers to come up with the right mix of architecture in addressing the needs of various markets. We will take a deeper look into compute platforms for various verticals in future articles.

The Most Comprehensive IoT Newsletter for Enterprises
Showcasing the highest-quality content, resources, news, and insights from the world of the Internet of Things. Subscribe to remain informed and up-to-date.
New Podcast Episode

Related Articles