We spoke to Sergio Grce, CEO of iSize Technologies (www.isize.co), about the challenges the industry is facing with ever higher resolution content delivery, and his company’s innovative use of AI for pre-processing to deliver higher quality images at lower bitrates – whatever the codec.
Tell us about iSize – when and where was it founded, by whom and with what business objectives?
We are a deep-tech technology company, powered by artificial intelligence, that is solving the problem of the increasing demand for high-quality video streaming by reducing bitrates while simultaneously improving the visual quality of streaming video.
I founded iSize in 2016, starting as a company that was using AI in super resolution to upscale still images and videos to create 4K resolutions from 1080p content.
Since then, we expanded into solving much bigger problems, primarily how to reduce the data footprint – file size in practical terms – and bitrate of video content delivered over the Internet or wireless links.
Please explain how you mathematically represent key elements of human visual perception – and what led you to this seemingly breakthrough approach to handling video through pre-processing prior to encoding and optionally post-decoding in the iSize BitSave?
In the last few years, the video content quality assessment community has moved from looking at the signal waveform, e.g., using signal-to-noise ratio and similar low-level metrics for video distortion, to quantifying visual quality using higher-level visual quality metrics that have been shown to correspond much better to what people actually see in terms of video distortion during Video-on-Demand or live content playback on their devices. For example, these metrics focus a lot more on structural aspects of content, appearance of textures and motion artifacts in the displayed video, etc. Examples of such metrics are the multi-scale structural similarity or something more sophisticated like the Video Multimethod Assessment Fusion (VMAF) that was proposed by Netflix and is now widely deployed for video content quality assessment in VoD or live streaming versus the source material.
So we have seen a turning point in the quality assessment community: moving from low-level metrics quality assessment to mid- and high-level metrics. At iSize, we work on ways to mathematically represent such metrics as functions that can be minimised at the pixel level by an AI engine that pre-processes the content prior to the actual encoding. A conceptual example of what our AI engine does would be to think of it as a mechanism that enhances textures, geometric structures, foreground and background detail, human faces, and so on, while also attenuating details that the human eye will not focus on.
What this means is that we can keep the bitrate of subsequent encoding the same – or even lower – and when assessed with existing quality metrics the processed and encoded images will actually score better than the same encoding of the source material.
Two factors guide our thinking.
One is that it is widely recognised that encoding complexity and associated bitrate savings are hitting a complexity-scaling barrier. Every new standard claims to be saving 20-40 percent in terms of average bitrate for the same video quality, but in reality, the saving is much closer to 20 percent because it is extremely computationally-intensive to implement all encoding optimizations. This is true even for large companies with large amounts of live streaming content: even if they want to adopt the latest optimizations allowable by the new encoding standards, they also face this computational explosion for diminishing savings in bitrate.
Secondly, the move to higher-level perceptual metrics is, in our view, inevitable. People have already attempted to pre-process video content using focus-of-attention methods or custom-made enhancement filters and so on, but they were never satisfactory, largely because they are essentially hand-crafted methods using a variety of rules and assumptions that do not work well for diverse types of content. In short, they were not based on learnable solutions based on data.
Machine learning has now matured enough that it’s possible to scale this type of learning based on multiple loss functions and multiple mathematical representations of metrics that can be used to train automated neural networks to perform those operations for a wide variety of content.
That’s what we believe is happening in that technical space, and why it’s important.
What kind of bandwidth savings are you achieving – and does this vary by different types of content? (please explain). Also, what are the speed advantages?
We have shown on large data sets in the public domain, for example in the YouTube UGC dataset for video compression research, that when assessing quality using high-level metrics like VMAF as well as low-level metrics like structural similarity index (SSIM), the average bitrate saving for the same quality using BitSave over 1,500+ clips is 30%. This has been validated with two generations of encoding standards, the older AVC/H264 and the more recent HEVC/H.265 standard.
(Links for validation available via iSize web site)
Have you got any real-world examples of use in the commercial world yet? If so, please share.
We are currently engaged in a number of commercial and technical discussions and have also released comparison clips to illustrate bitrate savings of 30-40 percent, as well as full bitstreams, that are downloadable for independent inspection.
In terms of commercial examples, iSize is gaining considerable interest from industries outside broadcast and production. The fact that we are totally codec agnostic means that we can talk to pretty much anyone who wants to reduce the video bitrates without compromising quality – and therefore cut the costs without sacrificing viewers’ experience. Alternatively, we can offer improved visual quality for the usual streaming bitrates of existing standards.
Demand is increasing and a lot of new – and fairly large – streaming and VoD companies are entering the market, many of which are evaluating what we offer, including those involved in live events, gaming, and sport streaming as well as video within social media applications.
In addition, we’re currently involved in client testing for the use cases like video conferencing as well as the defence and security sectors.
How does machine/deep learning play a part in the iSize process?What is it learning from?
Basically, it’s what in mathematical terms is called ‘semi-supervised learning’. You can think of semi-supervised learning as content passing through a BitSave pipeline that is first pre-processed with iSize’s machine learning IP. Any distortion or losses that come in from any type of encoder at ingest are compared on egress and the machine learning updates the relevant models.
We learn by emulating the process of machine learning enhancement and encoding; comparing that output with the output of just the encoder and use the result to update our models during training. Once trained, our models can be deployed for live performance with no disruption in the usual video encoding and delivery pipeline.
In your contribution to the Future Trends Theater at IBC, you indicated that you had been inspired by work in the audio side of the industry. Please expand.
The evolution of audio encoding reached a similar bottleneck, i.e., not being able to compress high fidelity audio to bitrates below 1Mbps. What subsequent technologies did, e.g., starting from the MPEG Audio Layer 3 (MP3) and beyond, was to come up with psychoacoustic thresholds of audio subbands, which remove inaudible frequencies prior to the actual encoding of the audio stream. That effectively removes tonal frequencies that the human ear can’t pick up in each audio clip and thereby reduces the subsequent encoding bitrate all the way down to 64kbps with very high fidelity audio.
As we know, this enabled the web to carry voice, music, and, eventually, associated business transactions.
The video community has been talking about doing something similar for a long time, but video is much more difficult. The human visual cortex is far more complex as a biological neural network, and visual perception is still poorly understood in comparison to auditory perception in the human brain.
Every few months it seems there’s another new video coding ‘standard’ announced
The interesting thing about what we do is that as we test with newer codecs, our bitrate savings increase in comparison to older standards. So, the better the encoder, the higher the attained bitrate savings using our technology.
For example, with the current state of the on-going MPEG/ITU-T VVC standardization, we have shown that bitrate savings can be even higher than 30 percent, and we’re still improving further.
So, even though new standards come out, they take a long time to reach the market. So promises made can take a long time to emerge. But what we do delivers savings right now, and is backward compatible without breaking anything. In a sense, we’re codec agnostic.
Tell us about other products in the iSize offering. What are you planning next?
Our aim is to improve things on the server side with machine learning, pre-processing and so on, but further down the road there is no reason why we can’t also enhance the client/decoder side. We’re effectively working our way in from the two edges, the pre- and post-processing sides.
Once we’ve firmly established our commercial footprint with BitSave, we’ll move into discussions on how to improve the client side with post-decoder enhancement.
iSize has recently joined IABM. What are the most useful member benefits to your company?
Being connected to the experts in the industry; understanding their current issues with streaming; what they are focussing on for the future gives us great insights in how we can further improve our technology so it can be used to benefit everyone.
Being part of the IABM community is helping us a lot in understanding these things because we’re a deep-tech company rather than a broadcast manufacturer.
I like to think our membership also helps fellow IABM members understand what we’re doing, which means we all benefit from the cross-pollination of information and ideas.
And finally, membership gives us visibility that we would not otherwise have in a market that is very important to us, which is essential when you are introducing a new technology.