Perceiving pixels and bits : perceptual optimization of image and video encoding pipelines




Chen, Li-heng, 1989-

Journal Title

Journal ISSN

Volume Title



The use of ℓ [subscript p] norms has largely dominated the measurement of distortion in video encoding or loss in neural networks due to their simplicity and analytical properties. However, when used to assess the loss of visual information, these simple norms are not very consistent with human perception. Given the continuously growing demand for online videos, improving the performance of video compression in perceptual ways has become an important, yet challenging problem, as humans are the ultimate receiver of visual signals. The main contribution of this thesis is to provide new directions for optimization of components in video workflows, in which the topics of hybrid video codecs, resizer, and learned image compression models are covered. The first part of this thesis studies the chroma distortions in conventional video compression standards. It is empirically known that the chroma components are less sensitive to human perception, yet has not been studied as much in the application in video compression. To this end, we carried out a subjective experiment to understand the interplay between luma and chroma distortions. We also found that there is room for reducing bitrate consumption in modern video codecs by creatively increasing the compression factor on chroma channels. On the other hand, video downsampling is also a crucial module in adaptive streaming scenarios. This thesis introduces a new data-driven downsampling model realized using deep neural networks. Since the layers of convolutional neural networks can only be used to alter the resolutions of their inputs by integer scale factors, we seek new ways to achieve fractional scaling, which is crucial in many video processing applications. The second part of this thesis explores the perceptual aspect of optimizing learning-based lossy image compression models. Although numerous powerful perceptual models have been proposed to predict the perceived quality of a distorted picture, most other image quality indexes have never been adopted as deep network loss functions, because they are generally non-differentiable. To address this problem, we propose a new "proximal" approach, called the ProxIQA training, to optimize image analysis networks against quantitative perceptual models. We also describe a search-free resizing framework that can further improve the rate-distortion tradeoff of recent learned image compression models. Our approach is simple: compose a pair of differentiable downsampling/upsampling layers that sandwich a neural compression model. To determine resize factors for different inputs, we utilize another neural network jointly trained with the compression model, with the end goal of minimizing the rate-distortion objective. Among these, quantitative simulations and subjective quality studies show that the proposed methods yield significant improvements in coding efficiency. The thesis concludes with some remarks on future directions and open problems.


LCSH Subject Headings