Reverse-engineering visual quality through pixel-to-pixel deep neural networks

By August 27, 2019Technical Articles

By Dr Yiannis Andreopoulos, CTO, iSize Technologies

Reverse-engineering visual quality through pixel-to-pixel deep neural networks

The focus of iSize was originally on the intelligent live upscaling of video, and that’s still a very active area for us. We have done a lot of testing on live video upscaling from HD to 4K and beyond.

Our IP in this domain is now something you can put inside a system-on-chip (SoC) solution, with models and associated code that can be licenced to a third party, such as a device manufacturer. But, during our testing, we quickly hit upon something new: we saw that you can reverse-engineer the client-side video upscaling and intelligently process the input video content before it even hits the actual video encoder.

Before we discuss this, let’s take a quick step back to understand the context a bit better. When video is streamed over the internet today, a range of streaming and encoding recipes must be selected. While there are several technical terms and technologies for this, like DASH, HLS, CBR, VBR, QVBR, CABR, capped-CRF, etc., this process essentially boils down to the selection of a number of encoding resolutions, bitrates and encoding templates. The latter mainly control how the encoder allocates bits within frames of each video segment in time. Therefore, the encoding and streaming process is bound to change the frequency content of the input video and introduce (ideally) imperceptible or (hopefully) controllable loss in return for bitrate savings. This quality loss is measured with a range or quality metrics, ranging from low-level signal-to-noise ratio metrics, all the way to complex mixtures of expert metrics that capture higher-level elements of human visual attention and perception. One such metric that is now well-recognised by the video community and the Video Quality Experts Group (VQEG) is the Video Multi-method Assessment Fusion (VMAF), proposed by Netflix. There has been a lot of work in VMAF to make it a “self-interpretable” metric: values close to 100 (say 93 or higher) mean that the compressed content is visually indistinguishable from the original, while low values (say below 70) mean that the compressed content has significant loss of quality in comparison to the original. It has been reported that a difference of around 6 points in VMAF corresponds to the so-called Just-Noticeable Difference (JND), i.e. quality difference that will be noticed by the viewer.

The process of encoding and decoding always requires the use of linear filters for the production of the decoded (and often upscaled) content that the viewer sees on their device. Alas, this tends to lead to uncontrolled quality fluctuation in video playback, or poor-quality video playback in general. We most often experience this when we happen to be in an area with poor 4G/WiFi signal strength, where the high-bitrate encoding of a 4K stream will quickly get switched to a much lower-bitrate/lower-resolution encoding, which our player will keep on upscaling to our monitor or TV resolution while we continue watching.

Therefore, if one wants to apply deep learning to this problem, the overall research question can be abstracted as: How can we optimally preprocess (or precode – in our nomenclature) the pixel stream of the video into a (typically) smaller pixel stream, in order to make standards-based encoders as efficient (and fast) as possible? We are especially interested in this question given that:

  1. the client device can upscale the content with its existing linear filters and
  2. perceptual quality is now measured with the latest advances in perceptual quality metrics from the literature, e.g., using VMAF or similar metrics?

At iSIZE, we took that on as a problem to solve.

First, we identified that, currently, one can approach this in three distinctive ways.

The first type of approaches consists of solutions attempting device-based enhancement, i.e. advancing the state-of-the-art in intelligent video upscaling at the video player when the content has been “crudely” downscaled using a linear filter like the bicubic or variants of the Lanczos or other polyphase filters. Quite a few of these products are already in the market, including SoC solutions embedded within the latest 8K televisions. While there are certain promising advances in this domain (including some that we’ve seen with our own upscaling solutions), this category of solutions is limited by the stringent complexity constraints and power consumption limitations of consumer electronics. In addition, since the received content at the client is already distorted from the compression (quite often severely so), there are theoretical limits to the level of picture detail that can be recovered by client-side upscaling.

A second family of approaches consists of companies developing their own bespoke image and video encoders, typically based on deep neural networks. This deviates from encoding, stream-packaging and stream-transport standards and creates bespoke formats, requiring bespoke transport mechanisms and bespoke decoders in the client devices. This is a risky proposition for mainstream video encoding services. In addition, the 50+ years of video encoding have already done their bit to squeeze enough coding gain out of most situations, thereby making the current state-of-the-art in spatio-temporal prediction and encoding very difficult to outperform with neural-network solutions that are designed from scratch and learn from data.

The third family of methods comprises perceptual optimisation of existing standards-based encoders by using perceptual metrics during encoding, and there are quite a few companies already doing that. Here, the challenges are that:

  1. the required tuning is severely constrained by the need for compliance to the utilised standard
  2. many of the proposed solutions tend to be limited to focus-of-attention models or shallow learning methods, e.g., assuming that the human gaze is focusing on particular areas of the frame (for instance, in a conversational video we tend to look at the speaker(s), not the background)
  3. such methods tend to require multiple encoding passes, thereby increasing complexity. Because of these issues, all such designs are very tightly coupled to the specific encoder implementation.

Therefore, redesigning them for a new encoder and/or new standard, e.g., from HEVC to VP9 encoding, requires very substantial effort.

Of these three camps, iSize is perhaps closest to the third, but we deviate in some important ways. Perhaps most importantly, we apply our processing before the pixel content hits the video encoder, via a single pass through each video frame, in a framework that we call deep video precoding. Effectively, we use deep neural networks to optimally pre-process, and potentially shrink, the pixel stream of each frame into another pixel stream that, if encoded by a standards-based encoder, will lead to higher perceptual quality for the same bitrate, or significantly-lower bitrate for the same perceptual quality. As shown in part via our first preprint, we achieve this by tuning the training process of our deep neural networks with a, so-called, “loss” function minimising the quality loss estimated by perceptual metrics like VMAF or DeepQA, while taking into consideration the typical linear upscaling of the client devices, and approximations of the frame entropy, i.e., the amount of bits one would need to encode the input to very high fidelity. This is done without changing any global contrast, brightness, histogram or colour properties of the content and it is not a generative approach. That is: there is no aesthetic change or addition of “new” content in the input video, except of ensuring that the perceptually-important aspects of each individual video frame are protected as best as possible from the subsequent signal losses imposed from the video encoder, device upscaling (if applicable) and perceptual appreciation by the viewer. That last part is important as human viewers also miss a lot of the fine-grain details when watching video, with these imperceptible details, quite often, wasting massive amounts of bits that could have been spent on the perceptually important aspects of each video frame.

Once our models are trained and deployed, they are effectively pixel-to-pixel deep processing machines that ingest and produce precoded input resolutions in a single pass over each frame, from the full resolution, all the way to significantly-downscaled content. Some examples are found in our portfolio page at:

We believe this approach has several advantages. For a start, it introduces deep learning into the video encoding realm without breaking standards or video playback software and devices. This happens because our deep precoder’s output is always a video frame that gets processed in the usual way.

Secondly, our neural network training essentially leverages the vast amount of important work done in the last twenty five years on perceptual video quality assessment. One can think of our deep precoding engines as reverse engineering perceptual metrics in order to make the overall encoding work as effectively as possible.

Naturally, we are very much aware that human visual perception is not fully encapsulated by the current low-level and high-level perceptual metrics. Importantly, we are not claiming to understand human visual perception beyond what is currently accepted for our utilised visual quality metrics. However, our framework can accept future incarnations of these metrics as they mature even further.

In fact, we believe there is a mutual benefit between precoding and perceptual quality assessment as “adversarial” approaches in this domain: The more we can reverse-engineer perceptual metrics in our precoding, the more we hope to motivate the VQEG community to improve upon them, in order to take advantage of the improved metrics in the future versions of our precoders.

Similarly, our approach is not invalidating advances in video encoding, or indeed device-level upscaling; we are only making these processes better and faster to run (thereby saving power consumption and cost), while anticipating further advances in the client-side upscaling in the next five to ten years.

Finally, we believe that precoding and future deep-learning oriented video encoding may fuse into interesting mixtures in the next ten years. But we’d like to think that deep video precoding is a simple-yet-powerful entry point for deep learning to begin to seriously influence the state of affairs in video delivery today.

As a pre-processing engine, deep video precoding can even help in extracting extremely meaningful semantics about what is being streamed at any given moment. This by itself has many interesting applications. But that is another story.