Perceptual quality prediction of social pictures, social videos, and telepresence videos

Date

2022-07-01

Authors

Ying, Zhenqiang

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The unprecedented growth of online social-media venues and rapid advances in technology by camera and mobile device manufacturers have led to the creation and consumption of a limitless supply of images/videos. Given the tremendous prevalence of Internet images/videos, monitoring the perceptual quality of images/videos would be a high-stakes problem. This dissertation focuses on the perceptual quality prediction of social pictures, social videos, and telepresence videos by constructing datasets of images/videos with their perceptual quality labels, as well as on designing algorithms that accurately predict the perceptual quality of images/videos.

While considerable efforts have been put into effectively predicting the perceptual quality of synthetically distorted images/videos, real-world images/videos contain complex, composite mixtures of multiple distortions that non-uniformly distribute across space/time. The primary goal of my research is to design automatic image/video quality predictors that can effectively tackle the widely diverse authentic distortions of images/videos. To develop effective quality predictors, we trained deep neural networks on large-scale databases of authentically distorted images/videos. To improve the quality prediction by exploiting the non-uniformity of distortions, we collected quality labels for both the whole images/videos and patches/clips cropped from them.

For social images, we built the LIVE-FB Large-Scale Social Picture Quality Database, containing about 40K real-world distorted pictures and 120K patches, on which we collected about 4M human judgments of picture quality. Using these picture and patch quality labels, we built deep region-based models that learn to produce state-of-the-art global picture quality predictions as well as useful local picture quality maps. Our innovations include picture quality prediction architectures that produce global-to-local inferences as well as local-to-global inferences (via feedback).

For social videos, we built the Large-Scale Social Video Quality Database, containing 39K real-world distorted videos and 117K space-time localized video patches, and 5.5 M human perceptual quality annotations. Using this, we created two unique blind video quality assessment (VQA) models: (a) a local-to-global region-based blind VQA architecture (called PVQ) that learns to predict global video quality and achieves state-of-the-art performance on 3 video quality datasets, and (b) a first-of-a-kind space-time video quality mapping engine (called PVQ Mapper).

For telepresence videos, we mitigated the dearth of subjectively labeled telepresence data by collecting 2k telepresence videos from different countries, on which we crowdsourced 80k subjective quality labels. Using this new resource, we created a first-of-a-kind online video quality prediction framework for live streaming, using a multi-modal learning framework with separate pathways to compute visual and audio quality predictions. Our all-in-one model is able to provide accurate quality predictions at the patch, frame, clip, and audiovisual levels. Our model achieves state-of-the-art performance on both existing quality databases and our new database, at a considerably lower computational expense, making it an attractive solution for mobile and embedded systems.

Description

LCSH Subject Headings

Citation