Perceptual monocular depth estimation

Pan, Janice S.
Journal Title
Journal ISSN
Volume Title

Monocular depth estimation (MDE), which is the task of using a single image to predict scene depths, has gained considerable interest, in large part owing to the popularity of applying deep learning methods to solve “computer vision problems”. Monocular cues provide sufficient data for humans to instantaneously extract an understanding of scene geometries and relative depths, which is evidence of both the processing power of the human visual system and the predictive power of the monocular data. However, developing computational models to predict depth from monocular images remains challenging. Hand-designed MDE features do not perform particularly well, and even current “deep” models are still evolving. Here we propose a novel approach that uses perceptually-relevant natural scene statistics (NSS) features to predict depths from monocular images in a simple, scale-agnostic way that is competitive with state-of-the-art systems. While the statistics of natural photographic images have been successfully used in a variety of image and video processing, analysis, and quality assessment tasks, they have never been applied in a predictive end-to-end deep-learning model for monocular depth. Here we accomplish this by developing a new closed-form bivariate model of image luminances and use features extracted from this model and from other NSS models to drive a novel deep learning framework for predicting depth given a single image. We then extend our perceptually-based MDE model to fisheye images, which suffer from severe spatial distortions, and we show that our method that uses monocular cues performs comparably to our best fisheye stereo matching approach. Fisheye cameras have become increasingly popular in automotive applications, because they provide a wider (approximately 180 degrees) field-of-view (FoV), thereby giving drivers and driver assistance systems more visibility with minimal hardware. We explore fisheye stereo as it pertains to the problem of automotive surround-view (SV), specifically, which is a system comprising four fisheye cameras positioned on the front, right, rear, and left sides of a vehicle. The SV system perspectively transforms the images captured by these four cameras and stitches them together in a birdseye-view representation of the scene centered around the ego vehicle to display to the driver. With the camera axes oriented orthogonally away from each other and with each camera capturing approximately 180 degrees laterally, there exists an overlap in FoVs between adjacent cameras. It is within these regions where we have stereo vision, and can thus triangulate depths with an appropriate correspondence matching method. Each stereo system within the SV configuration has a wide baseline and two orthogonally-divergent camera axes, both of which make traditional methods for estimating stereo correspondences perform poorly. Our stereo pipeline, which relies on a neural network trained for predicting stereo correspondences, performs well even when the stereo system has limited overlap in FoVs and two dissimilar views. Our monocular approach, however, can be applied to entire fisheye images and does not rely on the underlying geometry of the stereo configuration. We compare these two depth-prediction methods in both performance and application. To explore stereo correspondence matching using fisheye images and MDE in non-fisheye images, we also generated a large-scale photorealistic synthetic database containing co-registered RGB images and depth maps using a simulated SV camera configuration. The database was first captured using fisheye cameras with known intrinsic parameters, and the fisheye distortions were then removed to create the non-fisheye portion of the database. We detail the process of creating the synthetic-but-realistic city scene in which we captured the images and depth maps along with the methodology for generating such a large, varied, and generalizable dataset