deep perceptual optimization for video encoding

Deep perceptual optimization for video encoding

iSize Technical Articles

This blog is based on a presentation made by Dr Yiannis Andreopoulos of iSIZE, as part of the Stanford Compression Workshop 2021. The complete presentation can be seen in the video below:


Video now makes up the vast bulk of internet traffic, in terms of bandwidth. Consequently, in the continuing search for efficiency and capacity, there is much interest in the perceptual optimization of video: the processing of digital video streams such that they deliver the quality users expect at the minimum bandwidth.

Digital video relies on compression, which is a processor-intensive process. Inevitably, to increase the efficiency of a codec, to deliver high quality content while reducing the bandwidth requirements, one must use codecs of higher sophistication – of much higher complexity. However, Moore’s Law and cloud-based scaling have both hit a wall. Even if more GPUs and CPUs are made available to encode video content, there is so much content being produced and watched that it very quickly outstrips the compute cycles available.

Our approach is to reduce the bandwidth needed for high-quality video streaming through a process that uses machine learning to reduce the bits required for elements of the image that perceptual metrics tell us are not important to human viewers. It is clear that finding trade-offs between the various metrics, between bitrate and perception, and managing processing and encoding complexity is a challenge.

The iSIZE proposal is a server-side enhancement that is cross-codec applicable. By placing our technology before the encoder, we ensure it does not depend on a specific codec, and it optimizes both for low-level metrics like SSIM (structural similarity index metric), as well as for higher-level (and more perceptually-oriented) metrics like VMAF. Because it does not break coding standards, it can be used in existing distribution chains and with existing client devices.

We call our pre-processing a deep perceptual optimizer (DPO) because it uses single frames and applies a deep neural network that is optimizing perceptual quality of the subsequent encoding.

DPO is trained offline with large volumes of content and a virtualized model of an encoder that incorporates the effects of inter- or intra-frame prediction, transform and quantization, and entropy encoding in learnable functions. This emulation of a practical encoder means that we can ‘teach’ the pre-processing network how typical encoders will distort the incoming pixel stream at typical encoding bitrates. At the same time, we can get a rate estimate for a range of quality levels. This trains DPO to minimize the expected bitrate of an encoder when encoding the DPO-processed content, while at the same time maximize the encoder’s perceptual quality.

We estimate perceptual quality via a number of perceptual models that are based on established metrics that compare original and compressed frames. Using reference-based metrics ensures our pre-processing will not deviate from the source aesthetics. It also helps us understand where perceptual quality metrics activate, so we can develop our own updates to the metric methodology, thereby improving the system even further.

In a practical deployment, DPO sits just before a standard encoder with no change in the workflow of encoding, bitstream packaging, transport, decoding and playback on client devices. Importantly, there is no need for DPO to have access to (or change) the settings of the encoder, or even know what encoding standard is being used.

The single-pass nature and decoupling from specific coding standards allows for easy deployment on custom hardware or high-performance CPU/GPU clusters. For instance, DPO runs in real time (for 1080p/60 content) on mainstream CPUs in use within data centers, such as the Intel Xeon Platinum 8259ci with 12 cores under Intel’s OpenVINO framework. Alternatively, realtime operation can be obtained on NVIDIA Tesla T4 GPUs with OpenCV cuDNN.

What does the DPO achieve? It delivers significant savings in two directions. First, it reduces the bitrate required from a standard codec to deliver a certain quality level. Second, and perhaps more significant, it reduces the complexity – the number of processor cycles – of the encoder to deliver that quality.

Overall, DPO improves on multiple state-of-the-art quality metrics, and across multiple video encoding standards. We believe we can go further since our approach offers compounded gains to any encoder-specific perceptual quality optimization: a real, measurable, significant saving in bitrate without impacting visual quality.