Escaping The Complexity-Bitrate-Quality Barriers Of Video Encoders: Our Results With Versatile Video Coding (VVC)

iSize Technical Articles

By Dr. Vasileios Giotsas and Dr Yiannis Andreopoulos, iSize Technologies


As discussed in our two previous articles, our mission with our precoding technology available at is to develop deep perceptual preprocessors that leverage the power of data, loss function designs in machine learning, and advanced deep neural network designs, in order to increase perceptual quality of existing and upcoming video coding standards without requiring any change in the encoding or decoding process.

One may ask, why is this important when current and upcoming video coding standards (like AOMedia VP9/AV1/AV2 and MPEG/ITU-T HEVC/VVC) are poised to allow for increased visual quality at even lower bitrates than the ones we use today?

Even broader, why is bitrate in video still so important, when wireless bandwidth is increasing with 5G and beyond?

To pick the last question first, while bitrates under nominal device and wireless configurations are indeed increasing over time, so do the network users and their demands for content. This is in fact happening in an increasingly asymmetric way: industry experts expect more users & content bits per wireless transmit/receive channel & timeslot than ever before, especially as linear TV broadcast is being abandoned in favor of web-based streaming services. This is also part of the reason that, even if one has a fast broadband connection to their ISP, independent testing has shown that typical video streaming services will not use it to stream at higher bitrates. Other reasons for lower-than-expected bitrates for video streaming are that network caches and wireless transceivers on devices are operating in conditions that are far from “nominal”. This occurs due to: too many co-located devices, network connections that fluctuate very rapidly, or just because of WiFi network & device misconfigurations, which happen more often than we think.

With regards to the development of better video encoders and decoders, this is certainly true: the important work behind royalty-free and royalty-based standards is one of the pillars to balance the asymmetric growth of demand-vs-supply of wireless bits. However, there are significant concerns about the complexity of current and upcoming video codecs and their energy and cost footprint in consumer devices and datacenters. This has already been predicted for more than 15 years, e.g., see the coding gain vs. complexity graph at the top of the page from the classic 2005 article of Sikora in the Proceedings of the IEEE.

Thus, wireless bandwidth will continue to remain one of the most precious resources, and the delivery of high-quality video in livestreaming of video-on-demand scenarios will continue to clog datacenters and wireless connections. This impasse in complexity-bitrate-quality is a key aspect behind our developments at iSIZE.

To put our technology under another stress test, we decided to run our previously-reported MTurk validation of our deep perceptual precoder framework, but this time use the latest VVC Test Model (VTM). In a nutshell, we used the Amazon Mechanical Turk service to ask independent MTurk workers from around the world to evaluate full HD video encoded with VVC with and without our deep perceptual optimizer. We also compared the performance of an older and less performing codec: the MPEG/ITU-T HEVC/H.265 in conjunction with our precoder versus stand-alone VVC. The aim of that last test is to see if current standards can be boosted to the levels of quality-bitrate expected by VVC, without requiring the encoding and decoding complexity of the upcoming standard, or indeed the hardware upgrade in the server and client side. In the discussion that follows, we present our results under the same measurement setup as our previous test.

Recap Of Experimental Setup

To evaluate whether our approach brings perceptual quality improvement at the same bitrate or comparable quality at 40% lower bitrate that the VTM encoder, we asked users to watch two VVC encodings of the same video in split screen, and tell us which one (if any) they prefer: one of them is the VVC encoding after processing the original video with iSize precoder, while the other is the original video encoded with the same VVC encoder (no iSize precoding). Each test corresponded to one of following bitrate combinations:

  • iSize+VVC at 1.8mbps, VVC at 1.8mbps –> can we can offer noticeable quality improvement for FHD video encoded at low bitrates?
  • iSize+VVC at 1.8mbps, VVC at 3.0mbps –> can we achieve comparable or superior quality when offering 40% saving for medium-bitrate encoding? Notice that 3mbps may be “low bitrate” by usual VoD standards, but VVC is expected to be able to achieve reasonably high quality there.

We also carried out an additional test:

  • iSize+HEVC at 3.0mbps, VVC at 3.0mbps –> can we achieve comparable or superior quality to VVC when using a less-performing encoder that is much faster and readily available?

with the HEVC settings corresponding to the “slow” preset of the x265 encoder and a video buffer verifier (VBV) encoding recipe allowing for content-adaptive variable bitrate (VBR) encoding (max rate tuned at 3.0mbps and CRF parameter set to 19). The VVC settings are as provided in the latest VTM software (at the time of this writing we used version 6.2rc1). We enabled the default rate control (and set 1.8mbps and 3.0mbps as targets) and set IntraPeriod=64 frames.

Figure 1: User interface of visual quality comparison test


Figure 1 shows the user interface of the test. After watching the two videos playing in parallel, users were asked if they preferred the visual quality of the left side, the right side, or if they had no preference, using the buttons below the video player. The video playback started automatically in full-screen and users were able to pause and seek if they wanted to inspect individual video frames more carefully. Since the playback of both video encodings was concurrent, buffering time due to bitrate difference was not a factor in user preferences. We describe all the checks to ensure our visual quality scoring setup is robust in our previous article.


For each title, we collected 240 valid measurements from MTurk workers, 80 measurements for each bitrate combination. Figure 2 illustrates the results for each video, while Figure 3 shows the corresponding VMAF-bitrate plots. Table 1 lists the exact scores, including the VMAF difference for each bitrate combination. VMAF is reported as the mean value across each bitrate of each sequence, but very similar results have been obtained (and similar ΔVMAF) when using the harmonic mean for VMAF.

Figure 2: User preference (%) per video title. Note that VVC significantly exceeded the bitrate constraints of 1800kbps and 3000kbps for video 3 (crowd run).

Figure 3: Mean VMAF-bitrate plots per video title. Note that VVC significantly exceeded the bitrate constraints of 1800kbps and 3000kbps for video 3 (crowd run).

Table 1: User preference (%) per video title and bitrate combination, and VMAF difference (ΔVMAF = VMAFiSize+HEVC/VVC – VMAFVVC)

Discussion and Conclusions

The results show that for every video and for every bitrate combination except one, viewers had a strong preference for the iSize+VVC results. In more detail, with the exception of the “crowd run” sequence, we observe from the results that:

  • For the 1.8mbps/1.8mbps (same bitrate comparison), over 60% of the viewers preferred the iSize+VVC version and less than 20% preferred the VVC version (i.e., more than 3:1 ratio).
  • For the 1.8mbps/3.0mbps (40% saving for iSize+VVC), over 58% of the viewers preferred the iSize+VVC version and less than 24% preferred the VVC version (more than 2.4:1 ratio).
  • For the iSize+HEVC at 3mbps versus VVC at 3mbps, over 56% preferred the former and 21% or less preferred the VVC version (more than 2.6:1 ratio).

The first two points show that the quality-bitrate benefits of our deep perceptual optimizer are applicable even when considering the frontier in video coding, i.e., even the latest version of the VVC test model can benefit from our approach in order to offer 40% additional bitrate saving, or to significantly improve perceptual quality at the same bitrate.

At the same time, the third point shows that our approach can be used in conjunction with an older and less-performing standard like HEVC and still outperform the latest version of the VVC test code in terms of A|B MOS testing and VMAF. This is important, especially under the appreciation that VVC VTM-based encoding for the above examples was found to be 50000x slower than real time, while x265 encoding for HEVC was only 2x slower than real time (under the same CPU and cloud instance setup). Thus, even under the expectation that the VTM software can be accelerated by up 100-fold versus its current runtime (under algorithmic accelerations that will have minimal impact in quality), this still leaves a factor of more than 200x in slow-down versus x265’s HEVC. On the other hand, for our test setup and the content under consideration, our approach was shown to perceptually outperform VVC even when using x265’s HEVC implementation as the underlying codec. The iSIZE precoder runtime was only x4 to x6 slower than real time, and this is an approach that is readily available today, without having to wait (and fund!) the tremendous amount of time and effort to bring the potential VVC accelerations to practical hardware). This means that our framework can bridge the complexity-bitrate-quality barrier of video encoding and allow for older (and faster) encoders (and older decoders in client devices!) to achieve perceptual quality of the future incarnation of the VVC standard today, and with no changes in the actual encoding, packaging, delivery and client infrastructures.

It is interesting to examine in more detail the one sequence that forms the exception to the above observations: “crowd run”. First, the VMAF-bitrate plot of Figure 3 shows that VVC encoding overshot the bitrate setting for this case by more than 65%. This means that the comparison of iSize+HEVC 3.0mbps vs. VVC 3.0mvps is actually not carried out at the same bitrate, but the VVC result is at substantially higher bitrate (around 5mbps, as shown by the rightmost point of the VVC line). Even so, as shown by Figure 3, VVC does offer higher VMAF for the point that is close to 3mbps, albeit only by approx. 2.1 points. The comparison under the same codec+bitrate (iSize+VVC at 1.8mbps vs. VVC at 1.8mbps) shows that our approach is still found to be superior. However, at 40% saving, our result is found to be inferior versus VVC for this sequence. Importantly, the ΔVMAF agrees with the MOS ratings provided by the MTurk raters, which further illustrates that VMAF is a predictive metric for reference-based perceptual quality “in the wild”.

On average, the results of Table 1 show that

  • there is more than 5:1 preference for iSize+VVC vs. VVC for the equal bitrate (67% vs. 13% for 1.8mbps) comparison, which matches what is expected for average ΔMAF=10.6 that is above the Just Noticeable Difference (JND) threshold of VMAF, seen to be around 5 to 6 points.
  • despite the 40% reduction in bitrate, approximately 2.4 times more MTurk raters preferred iSize+VVC at 1.8mbps vs. VVC at 3mbps (58% vs. 24%, with another 18% saying they “look about the same”), with ΔVMAF=4.6 being close to the JND (that has been reported to be 6 points).
  • despite the use of an inferior encoder, approximately 2.2 times more MTurk raters preferred iSize+HEVC at 3mbps vs. VVC at 3mbps (58% vs. 26%), with ΔVMAF=5.6 being close to the JND.

The ΔVMAF values tend to be correlated to the ratio of preference of human raters: when ΔVMAF is increasing, so does the ratio of raters that prefer iSize + VVC/HEVC versus VVC. We have found that this also depends on the VMAF numbers: when approaching VMAF=100 (as in the “aspen” and “old town” sequences), this correlation becomes less evident. We note that similar ΔVMAF results have been obtained when using the harmonic mean for VMAF, which shows that all encodings under consideration produce similar outlier behaviours.

Similar to our previous experiments, it is acknowledged that the amount of crowdsourced scores and the number of videos used for this assessment cannot provide for strong statistical significance. However, as noted in our previous article, we ensured visual scoring was protected from spurious raters or rating conditions, and that the selected sequences were diverse enough to encapsulate a wide variety of input scenes and motion/noise patterns.

To allow for further independent assessment, we have made the sequences and left-right viewing available in our portfolio page, and we encourage you to have a look. As previously, we provide the full bitstreams of each case (iSize+VVC, VVC and iSize+HEVC).

In an effort to encourage further independent testing, our platform at now allows for the use of the “Precoder” and “R&D” options (the latter is limited to 10min videos and produces a QP=0 encoding to avoid any quantization-induced distortion). These produce the iSize perceptual precoder pixel output (packaged in very-high bitrate/lossless H.264/AVC and the mp4 container) that can then be encoded with any third-party encoder. We encourage you to have a look and send us your thoughts and comments at