The world is three-dimensional, but cameras capture it in two dimensions. Recovering the lost depth dimension -- understanding how far away every object is and reconstructing 3D structure from flat images -- is one of the most important challenges in computer vision. From autonomous vehicles needing to know the distance to obstacles, to augmented reality apps placing virtual objects in real spaces, to robots navigating physical environments, 3D computer vision is essential for AI systems that interact with the physical world.

Depth Estimation: Recovering the Third Dimension

Depth estimation predicts the distance from the camera to every point in a scene. There are several approaches, each with different hardware requirements and accuracy characteristics.

Stereo Depth Estimation

Stereo vision uses two cameras separated by a known distance (the baseline) to estimate depth through triangulation, mimicking how human binocular vision works. By finding corresponding pixels in the left and right images and measuring their displacement (disparity), the depth can be calculated geometrically. Deep learning models like RAFT-Stereo and CREStereo have dramatically improved stereo matching accuracy, handling textureless regions and occlusions that challenged traditional methods.

Monocular Depth Estimation

Monocular depth estimation predicts depth from a single image -- a task that seems impossible since infinite 3D scenes can produce the same 2D image. Yet deep learning models learn statistical depth cues from large datasets: perspective, texture gradients, relative object sizes, atmospheric haze, and occlusion patterns.

The breakthrough model MiDaS and its successor Depth Anything produce remarkably accurate relative depth maps from single images, enabling applications that previously required expensive depth sensors. While monocular methods predict relative rather than absolute depth, they work with any standard camera.

Active Depth Sensors

Active sensors project patterns (structured light) or pulses (LiDAR, ToF cameras) into the scene and measure the return signal. LiDAR produces sparse but highly accurate depth measurements used in autonomous vehicles and surveying. Time-of-Flight (ToF) cameras provide dense depth maps at shorter ranges, powering features like Face ID. Structured light sensors (like the original Kinect) project known patterns and infer depth from their distortion.

The trend in 3D vision is toward making depth accessible everywhere. Where once you needed expensive LiDAR or specialized stereo rigs, today's monocular depth models can estimate depth from any photograph taken with any camera.

Key Takeaway

Depth estimation methods range from hardware-dependent (LiDAR, stereo) to purely computational (monocular). Modern AI has made monocular depth estimation practical, democratizing 3D understanding for applications that cannot use specialized sensors.

Point Clouds: The Language of 3D

A point cloud is a set of 3D points, each with coordinates (x, y, z) and optionally color and other attributes, representing the surface of objects or scenes. Point clouds are produced by LiDAR scanners, depth cameras, and 3D reconstruction algorithms, and they are the most common representation for 3D data in computer vision.

Processing Point Clouds with Deep Learning

PointNet (2017) was the pioneering architecture for deep learning on point clouds. It processes each point independently through shared MLPs and uses a symmetric function (max pooling) to aggregate global features, elegantly handling the unordered nature of point sets. PointNet++ added hierarchical processing to capture local structure at multiple scales.

More recent approaches include point transformers that apply self-attention to point neighborhoods, sparse convolution networks like MinkowskiNet that efficiently process voxelized point clouds, and KPConv that defines convolution kernels directly in continuous 3D space.

3D Reconstruction and Novel View Synthesis

Neural Radiance Fields (NeRF)

Introduced in 2020, NeRF represents a scene as a continuous volumetric function that maps 3D coordinates and viewing directions to color and density values. A neural network learns this function from a set of 2D images with known camera positions, enabling photorealistic rendering of the scene from novel viewpoints. NeRF produced stunning results but was slow to train and render.

3D Gaussian Splatting

3D Gaussian Splatting (2023) has rapidly become the preferred method for 3D scene reconstruction and real-time novel view synthesis. Instead of a neural network, it represents scenes using millions of 3D Gaussian primitives, each with a position, covariance (shape), color, and opacity. These Gaussians are rendered using a differentiable rasterizer that achieves real-time frame rates -- often 100+ FPS -- while matching or exceeding NeRF's visual quality.

Applications of 3D Vision

Autonomous Driving: 3D perception from LiDAR and cameras enables vehicles to understand the spatial layout of roads, detect and track 3D objects, and predict trajectories of other road users.

Augmented Reality: AR systems must understand the 3D geometry of the real world to accurately place virtual objects, handle occlusion, and maintain stable positioning as the user moves.

Robotics: Robots use 3D vision for navigation, grasping, manipulation, and collision avoidance. Understanding the 3D geometry of objects is essential for a robot to interact with them physically.

Architecture and Construction: 3D scanning and reconstruction create accurate digital twins of buildings and construction sites, enabling progress monitoring, quality verification, and renovation planning.

Cultural Heritage: 3D reconstruction preserves cultural artifacts and historic sites as detailed digital models, enabling virtual tourism and protecting against loss from natural disasters or conflict.

Key Takeaway

3D computer vision is undergoing rapid transformation with techniques like 3D Gaussian Splatting enabling real-time, photorealistic 3D scene reconstruction. Combined with improving monocular depth estimation, 3D understanding is becoming accessible to a broad range of applications beyond specialized industries.