NVIDIA GTC 2023: Bridging the Gap Between “Codec” and “Avatar”

iSize News

iSIZE CTO Yiannis Andreopoulos gave an insightful presentation at the NVIDIA GTC Developer Conference on March 23rd 2023, where he talked about bridging the gap between “Codec” and “Avatar”.

During the presentation, Yiannis discussed the current neural avatar solutions and three challenges they need to overcome: reliability at scale, minimal or no offline data capture complexity, and minimal training/inference complexity. He presented some avenues to resolve these by merging neural avatar proposals together with conventional video/3D encoding standards. Such a merger can bridge the gap between traditional “codecs” and photorealistic neural avatars and offer significant runtime and bit-rate efficiency versus all existing work. 

Registration is required to watch the presentation. Click here to register for free. 


The method shows cropping of eyes and mouth for regions of interest sent to the encoder. Can this be generalized to other regions of video frame or other forms of content beyond conversational video?
Yes, that can be done, and we have already done work in this. The key is to put the emphasis on parts of the content that will cause geometric or uncanny valley artifacts and keep ROIs small and spatio-temporally coherent. This can work for any content, including gaming or other forms of media, as long as the data manifold can be learned by the warp engine.

Can the method operate with other NVIDIA encoders beyond nvenc HEVC?
Yes, any standard encoder (for example nvenc AVC or AV1) – or even a neural encoder – can be used.

How do you ensure that the rendered output is interpretable and controllable when deployed at scale?
Good question! Similar to other generative AI frameworks, we have to put some ‘guardrails’ to ensure no uncanny valleys are generated by the warp engine. This is done by the quality scoring using standard metrics and MOS scores we have collected and used a-priori to tune this. The second aspect is the ROI selection focusing on areas known to cause issues, e.g., eyes and mouth. If these are taken care of well, any artifact appearing will look like an encoding or camera/light capture artifact.

In your experience, what are some of the most important factors to consider when evaluating the visual quality of neural avatars?
Good question. This depends on the type of avatars (photorealistic or not). For photorealistic avatars, we recommend using P.910 testing is possible, which can also be done without a reference, e.g., P.910 ACR.

Would this work for live applications?
Yes, we show a live demo during the talk.

What kind of feedback you have received from users?
Quite interestingly, users do not perceive this to be anything else than conventional video (in the case of 2D), which is exactly the intent here! However, as we move also to versions that change the non-essential aspects of the appearance (e.g., hair/clothes), the feedback is quite encouraging, as they see how this form of generative AI can offer benefits beyond just bitrate reduction. Such as changing clothes/hairstyle in a remote presence framework.

Are you only working with just avatars extracted from 2D videos? Are you working on objects generated from volcap or other 3D captures?
Both 2D and 3D, but at the moment the 3D is trained offline with a few captures. Going to volcap/live 3D capture would definitely be the next frontier!

For BitGen2D does the training need to learn the characteristics of the specific real codec being used? Do you need to train again for different codecs?
No need to train for the encoder, as the encoder only gets ‘the leftovers’. We need to guardrail the model for the presentation to be a faithful rendering of the speaker and this is mainly the quality scoring proxy and the ROI selection.

What do you think are the next steps in the development of neural avatars, and where do you see the field heading in the coming years?
Generative AI will definitely have a massive impact in this domain. We see the biggest impact in being in intelligent fusion of conventional technologies (like 2D/3D encoders) guardrailing the gen-AI frameworks in order to save compute and also ensure deployment at scale is safe and controllable, and also a very enjoyable experience by the users!

Beyond the presentation, will you be releasing any data and video samples for inspection, or other parts of the system? Where can interested parties find more information?
Thank you for the question. Yes, head to our solutions for more information.