Browsing by Subject "Depth estimation"

Now showing 1 - 3 of 3

Applied statistical modeling of three-dimensional natural scene data
(2014-05) Su, Che-Chun; Bovik, Alan C. (Alan Conrad), 1958-; Cormack, Lawrence K.
Natural scene statistics (NSS) have played an increasingly important role in both our understanding of the function and evolution of the human vision system, and in the development of modern image processing applications. Because depth/range, i.e., egocentric distance, is arguably the most important thing a visual system must compute (from an evolutionary perspective), the joint statistics between natural image and depth/range information are of particular interest. However, while there exist regular and reliable statistical models of two-dimensional (2D) natural images, there has been little work done on statistical modeling of natural luminance/chrominance and depth/disparity, and of their mutual relationships. One major reason is the dearth of high-quality three-dimensional (3D) image and depth/range database. To facilitate research progress on 3D natural scene statistics, this dissertation first presents a high-quality database of color images and accurately co-registered depth/range maps using an advanced laser range scanner mounted with a high-end digital single-lens reflex camera. By utilizing this high-resolution, high-quality database, this dissertation performs reliable and robust statistical modeling of natural image and depth/disparity information, including new bivariate and spatial oriented correlation models. In particular, these new statistical models capture higher-order dependencies embedded in spatially adjacent bandpass responses projected from natural environments, which have not yet been well understood or explored in literature. To demonstrate the efficacy and effectiveness of the advanced NSS models, this dissertation addresses two challenging, yet very important problems, depth estimation from monocular images and no-reference stereoscopic/3D (S3D) image quality assessment. A Bayesian depth estimation framework is proposed to consider the canonical depth/range patterns in natural scenes, and it forms priors and likelihoods using both univariate and bivariate NSS features. The no-reference S3D image quality index proposed in this dissertation exploits new bivariate and correlation NSS features to quantify different types of stereoscopic distortions. Experimental results show that the proposed framework and index achieve superior performance to state-of-the-art algorithms in both disciplines.
From active to passive spatial acoustic sensing and applications
(2022-12-21) Sun, Wei (Ph. D. in computer science); Qiu, Lili, Ph. D.; Mok, Aloysius K.; Harwath, David; Yun, Sangki
The active acoustic sensing system emits modulated acoustic waves and analyzes reflection signals. It is dominant in acoustic spatial sensing. On the other side, the passive acoustic sensing system receives and investigates nature sounds directly. It is good at semantic tasks but has weak performance on spatial sensing. In this dissertation, we manage to bridge three gaps in existing systems. They are the gap between the assumption of signal processing algorithms and the real acoustic environment, the gap between powerful active spatial sensing and limited passive spatial sensing, and the gap between the semantic features and spatial information. We evolve the acoustic sensing system design and extend the functionalities by three novel systems. First, we develop a fully active spatial sensing system DeepRange which can adapt to the real environment easily. We develop an effective mechanism to generate synthetic training data that captures noise, speaker/mic distortion, and interference in the signals. It removes the need of collecting a large volume of data. We then design a deep range neural network (DRNet) to estimate the distance from raw acoustic signals. It is inspired by signal processing that an ultra-long convolution kernel size helps to combat noise and interference. The model is fully trained over synthetic data, but it can achieve sub-centimeter error robustly in real data despite various environments, background noise, interference, and mobile phone models. Second, we develop a fused active and passive spatial sensing system for speech separation noted as Spatial Aware Multi-task learning-based Separation (SAMS). We leverage both active sensing and passive sensing to improve AoA estimation and jointly optimize the semantic task and the spatial task. SAMS estimates the spatial location and extracts speech for the target user during teleconferencing simultaneously. We first generate fine-grained spatial embeddings from the user’s voice and inaudible tracking sound, which contains the user’s position and rich multipath information. Furthermore, we develop a deep neural network with multi-task learning to jointly optimize source separation and location. We significantly speed up inference to provide a real-time guarantee. Finally, we deeply fuse the semantic features and spatial cues to combat the interference and noise in the real environment as well as enable depth sensing in a fully passive setup. Inspired by the ”flash-to-bang” phenomenon (i.e.hearing the thunder after seeing the lightning), we propose FBDepth to measure the depth of the sound source. We formulate the problem as an audio-visual event localization task for collision events. Specifically, FBDepth first aligns correspondence between the video track and audio track to locate the target object and target sound in a coarse granularity. Based on the observation of moving objects’ trajectories, it proposes to estimate the intersection of optical flow before and after the collision to locate video events in time. It feeds the estimated timestamp of the video event and the other modalities for the final depth estimation. We use a mobile phone to collect the 3.6K+ video clips involving 24 different objects at up to 60m. FBDepth shows superior performance especially at a long range compared to monocular and stereo methods.
Perceptual monocular depth estimation
(2020-03-24) Pan, Janice S.; Bovik, Alan C. (Alan Conrad), 1958-; Ghosh, Joydeep; Vikalo, Haris; Huang, Qixing; Mueller, Martin
Monocular depth estimation (MDE), which is the task of using a single image to predict scene depths, has gained considerable interest, in large part owing to the popularity of applying deep learning methods to solve “computer vision problems”. Monocular cues provide sufficient data for humans to instantaneously extract an understanding of scene geometries and relative depths, which is evidence of both the processing power of the human visual system and the predictive power of the monocular data. However, developing computational models to predict depth from monocular images remains challenging. Hand-designed MDE features do not perform particularly well, and even current “deep” models are still evolving. Here we propose a novel approach that uses perceptually-relevant natural scene statistics (NSS) features to predict depths from monocular images in a simple, scale-agnostic way that is competitive with state-of-the-art systems. While the statistics of natural photographic images have been successfully used in a variety of image and video processing, analysis, and quality assessment tasks, they have never been applied in a predictive end-to-end deep-learning model for monocular depth. Here we accomplish this by developing a new closed-form bivariate model of image luminances and use features extracted from this model and from other NSS models to drive a novel deep learning framework for predicting depth given a single image. We then extend our perceptually-based MDE model to fisheye images, which suffer from severe spatial distortions, and we show that our method that uses monocular cues performs comparably to our best fisheye stereo matching approach. Fisheye cameras have become increasingly popular in automotive applications, because they provide a wider (approximately 180 degrees) field-of-view (FoV), thereby giving drivers and driver assistance systems more visibility with minimal hardware. We explore fisheye stereo as it pertains to the problem of automotive surround-view (SV), specifically, which is a system comprising four fisheye cameras positioned on the front, right, rear, and left sides of a vehicle. The SV system perspectively transforms the images captured by these four cameras and stitches them together in a birdseye-view representation of the scene centered around the ego vehicle to display to the driver. With the camera axes oriented orthogonally away from each other and with each camera capturing approximately 180 degrees laterally, there exists an overlap in FoVs between adjacent cameras. It is within these regions where we have stereo vision, and can thus triangulate depths with an appropriate correspondence matching method. Each stereo system within the SV configuration has a wide baseline and two orthogonally-divergent camera axes, both of which make traditional methods for estimating stereo correspondences perform poorly. Our stereo pipeline, which relies on a neural network trained for predicting stereo correspondences, performs well even when the stereo system has limited overlap in FoVs and two dissimilar views. Our monocular approach, however, can be applied to entire fisheye images and does not rely on the underlying geometry of the stereo configuration. We compare these two depth-prediction methods in both performance and application. To explore stereo correspondence matching using fisheye images and MDE in non-fisheye images, we also generated a large-scale photorealistic synthetic database containing co-registered RGB images and depth maps using a simulated SV camera configuration. The database was first captured using fisheye cameras with known intrinsic parameters, and the fisheye distortions were then removed to create the non-fisheye portion of the database. We detail the process of creating the synthetic-but-realistic city scene in which we captured the images and depth maps along with the methodology for generating such a large, varied, and generalizable dataset