Deep learning solutions for video encoding and streaming




Paul, Somdyuti

Journal Title

Journal ISSN

Volume Title



Video data has emerged as the top contributor to the global internet traffic, and video compression is the key technology that enables its efficient storage, transmission and retrieval. As the video compression technology advances to keep pace with the proliferation of video data, state of the art video codecs that rely on block based hybrid coding tend to become increasingly complex and computationally intensive. Moreover, currently, it appears challenging to significantly improve video compression efficiency by solely relying on traditional approaches. Consequently, deep learning techniques are being extensively explored in the context of designing video compression technologies. My research addresses the problem of making the benefits of data driven deep learning accessible to some key areas of video coding and compression based video streaming technologies.

First, this dissertation introduces the deep learning framework to speed up intra mode encoding in the VP9 video codec. In VP9 , the sizes of blocks are decided by a computationally intensive rate-distortion optimization (RDO) process, that evaluates the combinatorially complex search space of possible partitions of 64 × 64 superblocks. We devised a learning based alternative framework to predict the intra-mode superblock partitions using a hierarchical fully convolutional network (H-FCN), that was experimentally shown to speed up the intra-mode encoding of the reference VP9 encoder. Subsequently, our work on deep learning based block motion estimation is expounded. Block based motion estimation is essential for performing inter-prediction in hybrid codecs, a mechanism which is responsible for bulk of the compression capability achieved by it. However, prevalent block matching based procedures that are used to compute block motion vectors (MVs) are computationally intensive, are prone to detecting spurious motions which worsen at smaller block sizes, and are agnostic to the perceptual quality of the predicted frames. To address these issues, we developed a composite block translation network (CBT-Net) that jointly predicts the MVs of blocks having multiple sizes by using the MVs predicted for larger blocks to guide the motion estimation of smaller blocks. Our framework produces more coherent motion fields at smaller block sizes as compared to traditional block matching based MV estimation, and is also computationally efficient. Its rate-distortion performance gains are demonstrated for AV1 encoding.

The last part of this dissertation focuses on learning based approaches in the context of designing compression based adaptive video streaming. Adaptive video streaming relies on the construction of efficient bitrate ladders to deliver the best possible visual quality to viewers under bandwidth constraints. The traditional method of content dependent bitrate ladder selection requires a video shot to be pre-encoded with multiple encoding parameters to find the optimal operating points given by the convex hull of the resulting rate-quality curves. However, this pre-encoding step causes significant overhead in terms of both computation and time expenditure. To reduce this overhead, we employed a recurrent convolutional network (RCN) to implicitly analyze the spatiotemporal complexity of video shots in order to predict their convex hulls. The proposed RCN-Hull model substantially reduced the pre-encoding time needed for convex hull generation while closely approximating the optimal convex hulls. The competitive advantage of our method over existing ones based on heuristics or feature based machine learning was also demonstrated. The different deep learning frameworks that we introduced in this dissertation thus attest to the compelling advantages offered by deep learning based tools and techniques in driving the development and deployment of future video coding and streaming technologies.


LCSH Subject Headings