The traditional means of optimising video streaming workflows have run their course. Future advances will be made in software automated by AI.
Online video providers have never been under so much pressure. Excess demand has caused Netflix, YouTube and Disney+ to tune down their bitrates and ease bandwidth consumption for everyone, in the process deliberately compromising the ultimate quality of their service.
Even once the crisis has subsided operators will have to equate scaling growth with the cost of technical investment and bandwidth efficiency. Even in a world with universal 5G, bandwidth is not a finite resource.
For example, how can an esports streaming operator grow from 100,000 to a million simultaneous live channels and simultaneously transition to UHD?
“Companies with planet scale steaming services like YouTube and Netflix have started to talk about hitting the tech walls,” says Sergio Grce, CEO at codec developer iSize Technologies. “Their content is generating millions and millions of views but they cannot adopt a new codec or build new data centres fast enough to cope with such an increase in streaming demand.”
New codecs are meant to provide an answer to the needs of better quality and greater efficiency but the industry is coming to realise that traditional methods of development have reached the end of the line.
“Many of the basic concepts of video coding such as transcoding, motion estimation and compensation were developed in the 1970s and 1980s,” says Christian Timmerer, head of research and standardisation at Bitmovin. “MPEG-1, the base coding standard, was developed and implemented industry-wide in the early 1990s. Since then there have been incremental developments, optimisation with computing resources and power and memory, but the basic principles are still the same.
Even Versatile Video Coding (VVC) which MPEG is targeting at ‘next-gen’ immersive applications like 8K virtual reality is only an evolutionary step forward from HEVC.
“It still uses the block-based hybrid video coding approach, an underlying concept of all major video coding standards since H.261 (from 1988),” explains Christian Feldmann, video coding engineer at Bitmovin. “In this concept, each frame of a video is split into blocks and all blocks are then processed in sequence.”
AI enters the frame
It’s not only the concept which has reached its limit. So too has physical capacity on a silicon chip. There are more and more requirements for applications to have available general-purpose silicon such as CPU and GPU cores, DSPs and FPGAs. At the same time, new types of data are rapidly emerging such as volumetric video for 6-degree-of-freedom experiences.
“From a broadcaster and operator perspective the use of dedicated hardware for encoding streams to distribute to end users is rapidly disappearing as the benefits of pure software implementations that can be rapidly updated and deployed to lower-cost generic servers (or virtualised in cloud environments) have become increasingly apparent,” says Guido Meardi, CEO and co-founder, V-Nova. “However, there remains a huge number of professional and consumer devices from cameras to phones where dedicated hardware video encoding provides the small form factor and low battery power consumption that is critical for them.”
The R&D labs at the organisations whose patented technologies created MPEG standards are looking to machine learning and AI to crack the code.
According to Meardi: “AI/ML techniques differ fundamentally from traditional methods because they can solve multi-dimensional issues that are difficult to model mathematically.”
InterDigital helped developed telecoms standards like 4G, owns patents in HEVC and in VVC.
“We think that you could use AI to retain essentially the same schema as currently but using some AI modules,” says Lionel Oisel, director, Imaging Science Lab, InterDigital. “This would be quite conservative and be pushed by the more cost-conscious manufacturers. We also think that we could throw the existing schema away and start again using a compete end to end chain for AI – a neural network design.”
InterDigital is working on both but it is not alone. There are a range of different ways that AI / ML techniques can be used within video codecs. Some vendors have used machine-learning to optimise the selection of encoding parameters, whereas others have incorporated techniques at a much deeper level, for example, to assist with the prediction of elements of output frames.
First AI-driven solutions
V-Nova claims to the first company to have standardised an AI-based codec. It teamed with Metaliquid, a video analysis provider, to build V-Nova’s codec Perseus Pro into a AI solution for contribution workflows now enshrined as VC-6 (SMPTE standard 2117).
In addition, during IBC2019, it demonstrated how VC-6 can speed-up AI-based metadata content indexing championed by Al Jazeera, Associated Press, and RTÈ – all organisations with huge archives.
V-Nova explains, “Currently, broadcasters can only afford to analyse a small portion of their media archive or a limited sample of frames. They are often forced to reduce the resolution at which the analysis is performed because it’s faster and cheaper to process. However, lower resolutions lose details, which reduce the accuracy when recognising key features like faces or the OCR of small text.”
AP’s director of software engineering Alan Winthroub, called V-Nova and Metaliquid’s proof-of-concept a “step-change in performance” adding, “this means we can process more content, more quickly while generating richer data.”
Meardi says, AI/ML will never be a complete replacement for the wealth of techniques and tools that make up existing video compression schemes.
“However, there are a large number of areas where AI/ML has the potential to add further optimisations to the existing tools and its use will only increase as the industry gathers greater knowledge and expertise.”
One of the important ways that AI can do this is by calculating bitrate to optimise bandwidth usage while maintaining an appropriate level of quality. This is something that simply cannot be done by hand; there is too much information to process in the time before network conditions change again.
The video streaming world is also looking at content-aware encoding (CAE) in which an algorithm can understand what kind of content is being streamed, and optimise bitrate, latency, and protocols, accordingly.
Harmonic offers content-aware technology it brands EyeQ which aims to reduce OTT delivery costs and improve viewer experiences. It claims its CAE tests on 8K live streaming matches the efficiency of that promised by VVC, “proving that we can use today’s technology to deliver tomorrow’s content, and without burning the budget,” says Thierry Fautier, vp of Video Strategy.
Also using AI-optimised CAE in its coding tools is US developer Haivision. Late last year it bought Lightflow Media Technologies from Epic Labs and subsequently launched Lightflow Encode which uses machine learning to analyse video content (per title or per scene), to determine the optimal bitrate ladder and encoding configuration for each video.
It uses a video quality metric called LQI which represents how good the human visual system perceives video content at different bitrates and resolutions. Haivision claims this results in “significant” bitrate reductions and “perceptual quality improvements, ensuring that an optimised cost-quality value is realised.”
Perceptual quality rather than ‘broadcast quality’ is increasingly being used to rate video codecs and automate bit rate tuning. Metrics like VMAF (Video Multi-method Assessment Fusion) combines human vision modelling with machine learning and seeks to understand how viewers perceive content when streamed on a laptop, connected TV or smartphone.
It was originated by Netflix and is now open sourced.
“VMAF can capture larger differences between codecs, as well as scaling artifacts, in a way that’s better correlated with perceptual quality,” Netflix explains on its blog. “It enables us to compare codecs in the regions which are truly relevant.”
London-based startup iSize Technologies is working on a novel approach to the compression bottleneck using deep learning as precursor to the current encoding process. It has been expressly designed to capitalise on the growing trend for perceptual quality metrics such as VMAF.
iSize’s solution is to pass the original (mezzanine) file through a layer of perceptual optimisation prior to being encoded as normal using existing encoding platforms.
This ‘precoder’ stage enhances details of the areas of each frame that affect the perceptual quality score of the content after encoding and dials down details that are less important.
“Our perceptual optimisation algorithm seeks to understand what part of the picture triggers our eyes and what we don’t notice at all,” explains Grce.
This not only keeps an organisation’s existing codec infrastructure and workflow unchanged but is claimed to save 30 to 50 percent on bitrate at the cost in latency of just 1 frame – making it suitable for live as well as VOD.
The company has tested its technology (shown here) against AVC, HEVC and VVC with “substantial savings” in each case.
The system can be dialled to suit different use cases. Explains Grce: “Some directors of studio distributed VOD content will want to keep some grain in the picture for creative reasons and would not be happy to save 30% bitrate if all that noise was deleted. Gaming content on the other hand might opt for 40-50% savings because that type of content looks more pleasing to our eyes without ‘noise’. Live streaming is somewhere in between [those two applications].”
Grce says the tech is in the late stages of testing with a “global scale VOD platform”, with a “large UK live sport streaming platform” and beginning last stage evaluation with a “global social media platform.” A “large gaming hardware manufacturer” has also tested it and it has been demoed in use with AV1 at the invitation of Facebook and Google.
Compression and decompression mechanisms are the drivers behind the delivery of all VOD services from Amazon to Quibi. Adoption of new codecs is essential but likely to be quicker than the standard five-year norm because along with less requirement for hardware encoding more of the processing will run in the cloud.