Build Agentic Video RAG with Strands Agents and Containerized Infrastructure

🇻🇪🇨🇱 Dev.to Linkedin GitHub Twitter Instagram Youtube
Linktr

In the previous post “Ask Your Video: Build a Containerized RAG Application for Visual and Audio Analysis“, This post showed how to build a scalable containerized video processing pipeline using AWS Step Functions and Amazon ECS. This post shows how to transform that infrastructure into intelligent agent tools using the Strands Agents framework.

You’ll discover how your existing containerized video processing capabilities become autonomous AI tools that understand natural language requests and execute video analysis workflows.

Why Transform Your Pipeline into Agent Tools?

Your containerized video processing system handles workflows efficiently, but interacting with these capabilities requires technical knowledge of APIs, parameters, and workflows. By integrating with Strands Agents, you can:

Interact using natural language with your video processing pipeline
Automate multi-step workflows through intelligent decision-making
Combine video analysis with other business processes
Scale intelligent interactions across your organization

Architecture: From Pipeline to Agent Tools

The transformation builds upon your existing infrastructure:

Your Existing Containerized Pipeline

⚠️⚠️⚠️⚠️Infrastructure Requirement: Deploy the container-video-embeddings CDK stacks before running this notebook. ⚠️⚠️⚠️⚠️

🔄 Processing Workflow

The architecture processes videos through these automated steps:

📤 Video Upload: Video uploaded to S3 bucket triggers Lambda function
⚡ Workflow Orchestration: Step Functions initiates parallel processing streams
🎬 Visual Pipeline: ECS tasks extract frames using containerized FFmpeg → Bedrock generates image embeddings
🎤 Audio Pipeline: Transcribe converts speech to text → Semantic chunking → Bedrock generates text embeddings
📊 Vector Storage: Lambda functions store all embeddings in Aurora PostgreSQL with pgvector
🌐 API Access: API Gateway provides endpoints for search queries and status monitoring

New Agent Integration Layer

Strands Agents providing intelligent orchestration
Custom tools wrapping your existing APIs
Memory capabilities for context-aware interactions
Multi-modal reasoning combining visual and audio insights

The agent intelligently orchestrates your containerized infrastructure through automatic workflow management, upload detection, and status monitoring.

Building the Video Processing Agent Tool

Prerequisites

Complete the containerized setup from the previous blog post

Clone & Setup Environment

git clone https://github.com/build-on-aws/langchain-embeddings.git
cd langchain-embeddings/notebooks

# Create virtual environment
python -m venv venv

# Activate environment (macOS/Linux)
source venv/bin/activate
# Or on Windows
# venv\Scripts\activate

# Install requirements
pip install -r requirements.txt

The notebook 07_video_embeddings_container_with_strands_agents.ipynb demonstrates how to create video_embeddings_container tool that interfaces with your containerized pipeline. This tool provides video processing and retrieval capabilities using deployed AWS infrastructure from the container-video-embeddings CDK stacks.

Features:

Upload videos to S3 and trigger automated processing
Search video content using semantic similarity
List processed videos and their metadata
Uses deployed AWS infrastructure (no local processing required)

🔍 Infrastructure Validation

Before starting, validate that the infrastructure exists by running these lines from the notebook:

ssm_client = boto3.client('ssm', region_name=AWS_REGION)
stepfunctions_client = boto3.client('stepfunctions', region_name=AWS_REGION)

try:
    api_endpoint = ssm_client.get_parameter(Name="/videopgvector/api_retrieve", WithDecryption=True)["Parameter"]["Value"]
    bucket_name = ssm_client.get_parameter(Name="/videopgvector/bucket_name", WithDecryption=True)["Parameter"]["Value"]
    state_machine_arn = ssm_client.get_parameter(Name="/videopgvector/state_machine_arn", WithDecryption=True)["Parameter"]["Value"]

    print("✅ Infrastructure parameters found:")
    print(f"🔗 API Endpoint: {api_endpoint}")
    print(f"🪣 S3 Bucket: {bucket_name}")
    print(f"⚙️ State Machine: {state_machine_arn.split(':')[-1]}")

    test_payload = {"query": "test", "method": "retrieve", "k": 1}
    response = requests.post(api_endpoint, json=test_payload, timeout=10)
    print(f"🌐 API Status: {response.status_code} - {'✅ Connected' if response.status_code == 200 else '❌ Error'}")

except Exception as e:
    print(f"❌ Infrastructure check failed: {str(e)}")
    print("💡 Deploy the container-video-embeddings stacks first")

🧠 Memory-Enhanced Strands Agent

To add S3 memory capabilities for personalized cloud interactions, complete the Amazon S3 Bucket information in this code.

For detailed S3 memory tool explanation, see notebook 06.

from s3_memory import s3_vector_memory

# S3 Vectors Configuration
# S3 Vectors Configuration
os.environ['VECTOR_BUCKET_NAME'] = 'YOUR-S3-BUCKET'  # Your S3 Vector bucket
os.environ['VECTOR_INDEX_NAME'] = 'YOUR-VECTOR-INDEX'        # Your vector index
os.environ['AWS_REGION'] = 'us-east-1'                       # AWS region
os.environ['EMBEDDING_MODEL'] = 'amazon.titan-embed-text-v2:0' # Bedrock embedding model

cloud_memory_agent = Agent(
    model=model,
    tools=[video_embeddings_aws, display_video_images, s3_vector_memory],
    system_prompt=CLOUD_SYSTEM_PROMPT
)

print("✅ Memory-enhanced agent created!")

🎬 Video Processing

This notebook uses the same video from notebooks 05 and 06: AWS re:Invent 2024 session on “AI self-service support with knowledge retrieval using PostgreSQL” (YouTube link). The video covers:

Vector databases and embeddings for AI applications
Amazon Aurora PostgreSQL with pgvector for scalable vector storage
RAG (Retrieval Augmented Generation) implementations
Amazon Bedrock Agents for intelligent customer support
Real-world use cases and technical demonstrations

This content works well for testing our video analysis system with technical presentations that include both visual slides and detailed explanations.

🧪 Test Agents with Natural Language Video Analysis

Interact with your containerized pipeline through natural conversation:

print("🧪 Testing Cloud Video Agent")
print("=" * 50)

response1 = cloud_video_agent(
    "Search for content about 'aurora database scalability' using the cloud infrastructure. Return the top 5 results and explain what you found."
)
print(f"Cloud Agent Response: {response1.message}")

List Videos:

response2 = cloud_video_agent(
    "List all processed videos in the cloud storage and provide a summary of available content with processing statistics."
)
print(f"\nCloud Storage Summary: {response2.message}")

Try the memory capabilities:

USER_ID = "cloud_test_user"

response3 = cloud_memory_agent(f"""For user {USER_ID}:
1. Store my interest in cloud architecture and scalable video processing
2. Search the video content for 'production deployment' and 'scalability'
3. Remember key insights about cloud-native video processing
4. Provide a personalized summary based on my cloud architecture interests""")

print(f"\nPersonalized Cloud Analysis: {response3.message}")

response5 = cloud_memory_agent(f"I am {USER_ID}, Give me a summary of the content of each video")
print(f"\nPersonalized Cloud Analysis: {response5.message}")

Conclusion

By integrating Strands Agents with your containerized video processing pipeline, you’ve created a system that combines scalable infrastructure with intelligent interaction capabilities. This approach provides:

Natural language access to video processing workflows
Intelligent orchestration of containerized services
Scalable processing with cost-effective resource management
Multi-modal analysis combining visual and audio insights

The combination of containerized infrastructure with Strands Agents creates a foundation for advanced video analysis applications that can grow with your needs. Your containerized pipeline now responds to natural language, makes intelligent decisions, and provides comprehensive video analysis capabilities through conversational interfaces.