Depth Estimation in Autonomous Driving using Stereo Camera — Computer Vision (Part-1)

6 min readMar 17, 2021

Introduction

Self Driving Cars require accurate depth perception for the safe operation of our Autonomous Vehicle. If you don’t know how far are the cars infront of you, how can you avoid them while driving? LIDAR and Radar sensors are usually thought of as the primary 3D sensors available for perception tasks. However, we can get depth information from two or more cameras using multi-view geometry.

Measuring distance of an object from Camera is challenging but useful for interesting use cases like Autonomous Driving, Augmented Reality, 3D Scene Reconstruction. Depth is a key parameter to perform Perception, Navigation and Trajectory Planning. Depth estimation in Computer Vision and Robotics is commonly done via Stereo Vision using a special type of Camera called Stereo Camera.

Each eye views the visual world from a slightly different horizontal position, that each eye’s image differs from the other. Objects at different distances from the eye project images into the two eyes that differ in their horizontal position giving depth cues of horizontal disparity that are also known as binocular disparity.

What is Stereo Camera?

A Stereo Camera is a type of Camera with two or more Image Sensors. This allows the Camera to simulate Human Binocular Vision and therefore gives it the ability to perceive depth.

The human binocular vision perceives depth by using Stereo Disparity which refers to the difference in image location of an object seen by the left and right eyes, resulting from the eyes’ horizontal separation. Stereo Camera uses the similar approach by capturing the same scene from different views. The depth-perceiving is done by a Geometric approach called Triangulation.

Geometry of Stereo Sensor

A stereo sensor is usually created by two cameras with parallel optical axes. To simplify the problem even more, most manufacturers align the cameras in 3D space so that the two image planes are aligned with only an offset in the x-axis.

There are some important of the Stereo Sensor

Focal Length f: Distance between Camera Center and Image Plane
Baseline b: Distance along the shared x-axis between left and right Camera centers.

By defining a baseline to represent the transformation between the two camera coordinate frames, we are assuming that the Rotation matrix is identity and there is only a non-zero x component in the Translation vector. The R and T transformation therefore boils down to a single baseline parameter b.

Computing 3D Point Co-ordinates

Given the baseline b, focal length f, and the coordinates of the projection of the point O onto the left and right image planes. We can see two similar triangles formed by the left camera measurement as follows

The triangle formed by the depth z and the position x is similar to the triangle formed by the focal length f and the left measurement x component xl. From this similarity we can construct the equation z over equals x over xl. The same can be done for the right measurements but with the offset for the baseline included. In this case, the two triangles are defined by z, and the distance x minus b and the focal length f and the right measurement x component xr. Similarly, we can get a second equation relating z to x via the right camera parameters in measurements.

From these two equations, we can now derive the 3D coordinates of the point O. We define the disparity d to be the difference between the image coordinates of the same pixel in the left and right images. We can easily transform between image and pixel coordinates using the x and y offsets ~u and ~v.

We then use the two equations from the similar triangle relations to solve for the value of z as follows. From there we use the value of z to compute x with the following expression. Finally, we can repeat the process in the y direction with the same derivation to arrive at the following expression for y. The three components of the point position are now explicitly available from the two sets of pixel measurements available to us.

Computing the Disparity

As said, disparity is the difference in the image location of the same 3D point as observed by two different cameras. To compute the disparity we need to be able to find the same point in the left and right stereo camera images. This problem is known as the stereo correspondence problem. The most naive solution for this problem is an exhaustive search, where we searched the whole rate image for every pixel in the left image. Such a solution is extremely inefficient and will usually not run in real time to be used on self-driving cars. It’s also unlikely to succeed as many pixels will have similar local characteristics, making it difficult to match them correctly.

Luckily for us, we can use stereo geometry to constrain our search problem from 2D over the entire image space to a 1D line. We’ve already determined, how a single point is projected to both cameras. Now, let’s move our 3D point along the line connecting it with the left cameras center. Its projection on the left camera image plane does not change. However, we can notice about the projection on the right camera plane, the projection moves along the horizontal line. This is called an epipolar line and follows directly from the fixed lateral offset and image plane alignment of the two cameras in a stereo pair. We can constrain our correspondence search to be along the epipolar line, reducing the search from 2D to 1D.

Stereo Algorithm

Given the frame from logged video, make sure you have calibrated Stereo Camera and rectified the image

Source: Learning OpenCV 3: Computer Vision in C++ with the OpenCV Library” by Adrian Kaehler and Gary Bradski, Published by O’Reilly Media, Inc. p. 735

Take each pixel on the epipolar line in the left image.
Compare these left image pixels to every pixel in the right image on the same epipolar line.
Pick the pixel that has minimum cost. For example, a very simple cost here can be the squared difference in pixel intensities
Compute disparity d

The above equation says that the depth of a point in a scene is inversely proportional to the difference in distance of corresponding image points and their camera centers.

Input: Here is the image generated from CARLA Simulator

Output (Disparity Map):

Generate Depth Map

No, we will derive the depth from a pair of images taken with a stereo camera setup. Sequence of this procedure:

Get the focal length f from the K (Intrinsic Parameter) matrix
Compute the baseline b using corresponding values from the Translation vectors t
Compute depth map of the image