Choosing a 3D Vision Camera

Cameras that can map the world are useful for the next generation of IoT devices.



The need for computer vision systems that can perceive and analyze three dimensional scenes will grow rapidly, driven by the need for machines to interact with the 3D world in real time. Autonomous vehicles (Fig. 1), augmented reality, mixed reality and face recognition are just some examples. In this article, we will survey various types of 3D vision camera systems which capture three dimensional visual information to be used by computer vision algorithms.

How Waymo uses 3D vision cameras
Fig. 1: Waymo Autonomous Vehicle Technology (

Types of 3D Vision Cameras

Recording 3D scenes requires camera systems that can sense the depth of information of each corresponding pixel of an image, in addition to texture information (e.g. RGB). Those cameras are also known as range or depth cameras. There are many types, each with their pros and cons.

Passive Stereo Camera

Stereo cameras have been around for over 150 years [1]. Originally, they were mainly used for photography and movies to deliver scene depth to human viewers. Stereo cameras have two or more lenses. They mimic the binocular vision of humans.

The depth range depends on the distance between lenses (interocular distance). In around 2010, the popularity of stereo cameras has risen and fallen with the 3D TV/movie markets. However, the rapid development of virtual reality (VR) market in the last two years drives the need for 360 degree immersive content. The prices of stereoscopic cameras, like LucidCam, have dropped rapidly, making it more affordable for the mass market to generate 3D content.

Besides content generation, stereo cameras can generate depth map, using the difference in location of an observed object (disparity) between the left and right camera views to measure the depth of the object from the viewer (Fig. 2). To achieve this, the computer vision algorithm needs to accurately identify the corresponding points in the two images that are associated with the same physical point of the object. It is a compute intensive process.  

3D Vision Camera depth disparity
Fig. 2: Disparity of X’s image points (X and X’) on left and right image planes

Light Detection and Ranging (LiDAR) Scanner/Pulsed Time of Flight (ToF)

LiDARs use active sensors, which emit energy to illuminate the target object. They send pulsed laser, invisible to the human eye, to the object and receive the reflected pulse. Then, they derive the distance using the return time and wavelength of the laser. By rapidly scanning the target area (e.g. using mirrors) point by point with laser pulses, a depth map of the scene can be determined.

LiDARs can generate high precision and high resolution depth maps. In the past, they were mainly used for applications like terrestrial mapping (here are some other LiDAR applications). The mass market adoption of LiDAR is in autonomous vehicle technology, which requires fast and accurate depth mapping of surrounding areas. 

However, LiDAR sensors are typically more expensive and bulky (Fig. 3). Companies like Velodyne are working on driving down the cost for autonomous vehicle industry. Next generation solid state LiDAR sensors are being developed which promise lower cost and better performance.

3D Velodyne LiDAR camera
Fig. 3: Velodyne LiDAR

Continuous Wave Time of Flight (ToF) camera

LiDAR cameras mentioned in the previous section are too expensive for the consumer market. Continuous Wave Time-of-Flight (ToF) cameras are another type of ranging camera that illuminates the full scene with continuous wave modulated light and receives the reflected light using standard CCD or CMOS sensors. By measuring the phase shift of the received light wave (Fig. 4), the distance between the camera and the reflecting surface can be derived.

3D Vision Camera: ToF
Fig. 4: ToF phase shifting of received light wave

ToF cameras have no moving parts and relatively low cost semiconductor components. However, the resolution of ToF cameras tends to be lower. The Kinect V2 motion capture camera is one of the consumer applications of such kind of cameras.

Structured Light Camera

A structured light camera uses an active stereovision technique. Instead of measuring disparity between views of two observing cameras, the disparity between a projector and an observing camera is measured. Known light patterns in infrared (IR) are sequentially projected onto an object. The patterns are deformed by the geometric shape of the object.

An IR camera then observes the deformed pattern at a different direction. By analyzing the distortion of the observed pattern, i.e. the disparity from the original projected pattern, depth information can be extracted. Kinect V1 and iPhoneX True Depth camera (Fig. 5a, Fig. 5b) belong to this type of camera.

3D Vision Camera: iPhoneX dot projection
Fig. 5a: Apple iPhoneX dot projection patterns (Tech Insider Video)
Apple TrueDepth Camera
Fig. 5b: Apple TrueDepth Camera (

Other Variants

Each of the above camera systems have their own strengths and shortcomings.

StrengthsWeaknessesExample applications
LiDAR (Pulsed ToF)
  • No need for ambient light
  • Wide field of view
  • High Cost
  • Relative Bulky
  • Adversely affected by reflective properties of  materials (e.g. translucent, water)
  • Lower refresh rate
Passive Stereo
  • Uses traditional, simple low cost camera
  • Rich visual data for computer vision analytics
  • Works in both indoor and outdoor settings
  • Poor low light performance
  • Does not work well with textureless surfaces
  • Requires high processing power to derive depth map
Continuous Wave ToF
  • Simple and compact hardware
  • Requires low processing power
  • High refresh rate
  • No need for ambient light
  • Poor outdoor performance (under sunlight)
  • Adversely affected by reflective properties of  materials (e.g. translucent, water)
  • Interference by the presence of other ToF cameras
Structured Light
  • No need for ambient light
  • Higher resolution and accuracy than ToF
  • Relatively shorter range than ToF
  • Poor outdoor performance (under sunlight)
  • Adversely affected by reflective properties of  materials (e.g. translucent, water)
  • Laser speckle pattern on the target surface may not be desirable
  • Face recognition
  • 3D scanner
  • AR/VR, body tracking, Industrial
  • Vendors (Structure,Orbbec,Intel RealSense F200,  etc)


Newer breeds of cameras are showing up in the market that combine the above technologies to achieve better performance, an optimal cost, and wider use cases. For example, Intel RealSense cameras combine active IR pattern emitters with IR stereo cameras so that the depth cameras can work well under low light conditions.

IoT For All Newsletter
Sign up for our weekly newsletter and exclusive content!


Monocular Methods

Besides using range cameras, conventional monocular cameras can be used in combination with multi-view photogrammetry methods to achieve 3D capture and reconstruction of objects and scenes.

One approach is Structure from Motion (SfM). It is similar to how humans and animals observe 3D structure of the environment at different poses.  By moving around a scene/object and capturing numerous 2D images with multiple camera views, the SfM algorithm can reconstruct a detailed 3D representation of the scene/object.

This method has high computation requirements. Besides, it is more suitable for use cases where the target under observation is stationary, like 3D scanning of merchandize with conventional cell phone camera.

The Visual Simultaneous Localization and Mapping method (Visual SLAM), a real-time variation of SfM, is widely used in robotics. The purpose of Visual SLAM is mainly for navigation and not for 3D rendering of the environment.

Using Visual SLAM, a robot or drone can closely correlate its movement, location, and orientation with image sequence captured by its camera to develop a 3D map of the environment it is in. This allows it to effectively and efficiently navigate the space.

There is a growing number of cameras that can map the world, and detect and track objects in three dimensions. The features are useful for the next generation of IoT devices to interact with humans more naturally and handle complex use cases.