Challenges to adapt AI-based Video Codecs

Video data now dominates internet traffic, accounting for over 80% of total bandwidth consumption, with growth driven by streaming platforms (YouTube, Netflix, Prime Video etc.), social media (Facebook, Instagram etc.), video conferencing (Zoom, Teams, RingCentral etc.), and video surveillance systems. As demand rises, so does the need for more efficient video compression techniques to reduce bandwidth usage without sacrificing visual quality.

The Old Paradigm: Traditional Codecs

Traditional video codecs like H.264/AVC and H.265/HEVC rely on hand-engineered techniques such as:

Block-based motion estimation
Macro-blocking
Transform coding using Discrete Cosine Transform (DCT)
Entropy coding

These methods have been refined over decades, culminating in complex standards that offer high compression efficiency. However, each new generation such as H.266/VVC (Versatile Video Coding), increases in complexity, making implementation, optimization, and hardware support more challenging. Adoption of new standards often lags years behind standardization, partly due to this growing computational burden.

The New Paradigm: AI-Based Codecs

Recent advances in deep learning have paved the way for a new paradigm in video compression:

        Input video → Neural Network → Compressed Data

Instead of relying on fixed transforms and hand-designed heuristics, AI-based codecs learn to compress video data using data-driven models, often trained end-to-end. These models can adapt to content-specific patterns and exploit spatial-temporal redundancies more effectively.

Key characteristics of AI-based codecs:

Symmetric encoding and decoding: Unlike traditional codecs, where encoding is far more computationally intensive, AI models often have similar complexity for both operations.
Content-adaptive compression: Neural models can be fine-tuned or dynamically optimized based on the content type (Example: animation, sports, surveillance).
Potential for higher perceptual quality: Especially at low bitrates, AI codecs often outperform traditional ones in terms of subjective visual quality.

Current Research and Industry Efforts

Although AI-based codecs show promise, many are still in the research or early deployment stage. Real-time performance, hardware compatibility, and generalization remain open challenges.

Here are some notable initiatives in this space:

Microsoft DCVC-FM
Deep Context Video Compression with Feature Modulation. Delivers high compression efficiency, but not yet capable of real-time processing.
Apple / WaveOne ELF-VC
An advanced learned video codec. Strong performance, but again limited by high computational demand and real-time constraints.
Qualcomm NVC (Neural Video Codec)
Designed for real-time use with lower resource consumption. However, it currently lags behind in compression efficiency compared to heavier models.
Deep Render
A startup focused on deploying deep learning-based codecs in real-world applications, balancing compression gains with practical runtime constraints.

Challenges in Adopting AI-Based Video Codecs

While AI-powered video codecs promise major improvements in compression efficiency and perceptual quality, real-world adoption is still facing several critical challenges. Unlike traditional codecs, AI-based approaches introduce a new set of complexities ranging from model specialization to hardware dependencies and infrastructure costs.

1. One-Size-Fits-All Doesn’t Work: The Need for Application-Specific Models

AI codecs are not general-purpose by default. Each application domain has distinct characteristics, and a single model may not perform optimally across all use cases. This requires either fine-tuning or training models specifically for different scenarios:

Video Conferencing
- Background is usually static (unless users enable virtual backgrounds).
- Foreground objects include faces, laptops, and other small gadgets.
- Requires high face clarity; often benefits from face-aware compression.
- Traditional codecs use long-term reference frames to exploit temporal redundancy; AI codecs must learn this implicitly.
Video Surveillance
- Static backgrounds, but lighting conditions vary across time.
- Foreground includes people, vehicles, and animals etc.
- Needs to preserve detail during event triggers (Example: motion).
- Models must adapt to varying lighting and compress efficiently during inactivity.
Video Streaming
- Highly diverse content: every scene can have a new background and subject.
- Requires models trained on large, diverse datasets to generalize well.
- Compression must balance bitrate, visual quality, and latency.

As a result, training generalized models or maintaining multiple specialized models per use case significantly increases development complexity and deployment overhead.

2. Hardware Limitations: The Need for NPUs and AI Acceleration

AI codecs are compute-intensive, especially during encoding and decoding. Unlike traditional codecs, which can run on general-purpose CPUs with hardware acceleration and dedicated video processing units, AI models typically require Graphics Processing Units (GPUs) or Neural Processing Units (NPUs) or similar dedicated AI accelerators.

Most current consumer devices smartphones, TVs, laptops lack sufficient on-device AI compute for real-time encoding/decoding.
NPUs are gradually being integrated into mobile SoCs, and widespread NPU availability is expected by 2030.
Until then, AI codec deployment at scale will remain limited to cloud or high-end edge devices, increasing cost and latency.

Adopting AI codecs at scale would also require a complete re- architecture of existing video infrastructure, including hardware encoders, decoders, and content delivery pipelines, which is an effort that comes with significant cost.

3. New Quality Metrics Needed: Traditional Metrics Fall Short

Compression quality in AI codecs is tightly linked to the underlying model architecture and training data. However, traditional video quality metrics like PSNR (Peak Signal-to-Noise Ratio), MSE (Mean Squared Error), SSIM (Structural Similarity Index) are not adequate to evaluate perceptual quality for AI-based compression.

AI codecs often optimize for human perception rather than exact pixel reconstruction, which means a lower PSNR might still look better visually. There’s a need for new perceptual quality metrics tailored to ML codecs, capable of assessing:

Temporal consistency
Perceived sharpness
Scene integrity
Task-aware quality (Example: face detection performance in video calls)

Future models may also need adaptive quality scoring, where metrics shift based on content and context.

4. Lack of Standardization

Traditional codecs follow strict standards (Example: H.264/AVC, H.265/HEVC, H.266/VVC etc.), ensuring interoperability between encoders and decoders.

In contrast, AI codecs are not yet standardized.

Each vendor or research group uses custom model architectures and data pipelines. This leads to vendor lock-in and incompatibility: a video encoded by one AI codec cannot be decoded by another unless the exact same model is used.

Without a standardized AI codec framework, widespread adoption across platforms and devices remains a barrier.

5. High Infrastructure and Upgrade Costs

Deploying AI codecs requires major changes across the entire video delivery ecosystem:

Smart TVs, mobile phones, and media players will need hardware upgrades to support real-time ML-based decoding.
Typical device upgrade cycles (It depends on the type of device, for instance, smart TVs are usually upgraded every 2 to 6 years) mean that adoption will be slow, especially for embedded devices.
Cloud-based encoding solutions could offer a stopgap, but increase operational cost and energy consumption.

This high cost of transitioning to AI codec infrastructure, both in terms of compute and compatibility, makes industry-wide rollout a long-term vision rather than an immediate reality.

Conclusion

AI-based video codecs hold the potential to revolutionize video compression by delivering better quality at lower bitrates and adapting to content intelligently. However, challenges around model specialization, hardware acceleration, quality assessment, standardization, and infrastructure cost must be addressed before these codecs can be adopted at scale.

As research continues and hardware evolves, AI codecs could become mainstream by 2030, but overcoming these obstacles will require collaboration between researchers, industry stakeholders, and standards bodies to make it happen.

Disclaimer: The views and opinions expressed in this article are my own and based on personal research and understanding. This content is not affiliated with, endorsed by, or representative of any specific company, organization, or product.

Source link