A Cloud-Native Multimodal Video Understanding Platform

Using Deep Learning, Multimodal Fusion, and Large Vision-Language Models

Achyut Niroula

Department of Computer Science, Nipissing University, North Bay, Ontario, Canada

COSC 4896 – Honours Research I · Winter 2026

ABSTRACT

Video understanding is a rich but analytically demanding problem: a short clip simultaneously encodes spatial structure, object identities, depth relationships, temporal dynamics, spoken language, and environmental audio. Most existing systems address only a subset of these modalities, producing analyses that are either too shallow for practical use or too narrowly scoped for general application. This paper presents a Cloud-Native Multimodal Video Understanding Platform, a system with fully deployed worker, that fuses nine specialized AI perception models into a structured Unified Scene Representation (USR), which drives Qwen2-VL dense captioning and Claude-powered narrative synthesis. Development proceeded through three iterative phases, each motivated by concrete technical challenges: a CPU-only cloud architecture proof-of-concept (Phase 1), a GPU-accelerated three-model YOLO ensemble with Weighted Boxes Fusion (Phase 2), and a full nine-model multimodal perception pipeline with a MultiModal Fusion Engine (Phase 3). The completed system processes a 33-second video in approximately 25 minutes on an NVIDIA T4 GPU (EC2 g4dn.xlarge), constrained by the 16 GB VRAM budget of the T4 and an AWS quota ceiling that prevented access to higher-throughput GPU instances. Evaluation across a diverse set of test videos demonstrates that the multimodal fusion approach produces qualitatively richer and more accurate narratives than direct VLM inference alone. Results, limitations, and future directions are discussed.

I. INTRODUCTION

Video is simultaneously the most information-dense and most analytically challenging sensor modality available to automated systems. A single 30-second clip at standard resolution encodes thousands of individual frames, a complete audio track with speech and environmental events, temporal dynamics that describe how scenes evolve, and high-level semantic context that integrates all of the above. Despite decades of progress in computer vision and speech processing, most production-grade video analysis systems remain siloed: object detectors process individual frames without temporal awareness, transcription engines ignore visual context, and scene classifiers provide coarse labels that discard the precise spatial and temporal structure of what actually occurs.

The central thesis of this project is that genuine video understanding requires fusing every available modality into a structured intermediate representation, and then reasoning over that representation using a vision-language model capable of generating coherent natural language. No single model, however powerful, can replicate this integration alone: a large language model without visual grounding hallucinates visual details; an object detector without language cannot express relationships between objects; a transcription system without visual context cannot annotate what is happening during silences.

This paper documents the design, implementation, evaluation, and lessons learned from a Cloud-Native Multimodal Video Understanding Platform, developed over four months as part of COSC 4896 Honours Research I at Nipissing University. The system accepts arbitrary MP4 videos and produces richly structured, timestamped narrative reports by orchestrating nine specialized AI perception models through a novel MultiModal Fusion Engine.

A. Research Objectives

The project was guided by five primary research objectives:

1) Design and implement a scalable, cloud-native architecture capable of ingesting user-uploaded videos and returning structured AI analysis within a practical latency budget.

2) Build a modular, multi-model perception pipeline that extracts complementary visual, spatial, temporal, depth, and audio features from each video.

3) Develop a MultiModal Fusion Engine that aligns heterogeneous model outputs into a structured Unified Scene Representation (USR) suitable for grounding a vision-language model.

4) Integrate and evaluate Qwen2-VL for per-frame dense captioning and Claude claude-sonnet-4-6 for final narrative generation.

5) Quantitatively evaluate the pipeline against baseline approaches across diverse video categories using automated and human-based metrics.

B. Contributions

The primary contributions of this work are:

1) A three-phase iterative development methodology for cloud-native AI systems, documenting the specific challenges encountered and engineering solutions applied at each phase.

2) A MultiModal Fusion Engine and Unified Scene Representation (USR) schema that aligns outputs from nine heterogeneous perception models into a single structured intermediate representation for VLM grounding.

3) An empirical evaluation demonstrating that multimodal fusion improves narrative quality over direct VLM inference, measured using ROUGE-L, BERTScore, and human evaluation rubrics.

4) An end-to-end cloud system on AWS with a fully deployed worker (S3, SQS, DynamoDB, Cognito, EC2 g4dn.xlarge) with a publicly accessible Next.js frontend dashboard.

II. RELATED WORK

Early video analysis systems addressed modalities in isolation. Convolutional neural network-based action recognizers such as two-stream networks [Simonyan & Zisserman, 2014] and 3D ConvNets processed temporal sequences of frames but had no language output capability. Audio processing was handled by entirely separate pipelines, and outputs were coarse class labels rather than natural language descriptions.

Recent large vision-language models (VLMs) such as Video-LLaMA and Gemini 1.5 Pro accept raw video frames directly and produce free-form descriptions without a separate perception pipeline [Zhang et al., 2023]. While these models benefit from unified end-to-end training, they require substantially more VRAM than is available on cost-effective cloud hardware, and they do not expose structured intermediate representations (bounding boxes, depth maps, track IDs) useful for downstream applications. Dense video captioning systems such as Vid2Seq [Yang et al., 2023] generate timestamped captions but focus on single-modality visual input without audio integration. VALOR [Chen et al., 2023] addresses audio-visual-language alignment but is not deployed as an accessible cloud pipeline.

Multimodal ensemble approaches that combine specialized models have been explored in the document understanding and image captioning domains. In video, prior work combining object detection with speech transcription has typically been application-specific (e.g., sports commentary, surveillance). The present work proposes a domain-agnostic, cloud-native architecture that fuses nine modalities through a structured representation, specifically targeting the production of human-readable narrative summaries. To the authors' knowledge, no prior system combines this breadth of modalities with a structured fusion engine and LLM-based narrative synthesis in a deployed cloud application.

III. SYSTEM ARCHITECTURE

The platform is architected across three tiers. The presentation tier is a Next.js web dashboard that handles user authentication via AWS Cognito and direct-to-S3 video uploads using presigned URLs. The application tier is a FastAPI REST backend managing workflow orchestration, presigned URL generation, DynamoDB state management, and SQS message dispatch. The compute tier is a containerized GPU worker deployed on an AWS EC2 g4dn.xlarge instance (NVIDIA T4 GPU, 16 GB VRAM, 4 vCPU), which executes the full perception and fusion pipeline.

The end-to-end data flow proceeds as follows: (1) a user uploads an MP4 through the frontend; (2) the backend generates a presigned S3 PUT URL and the video is uploaded directly to S3; (3) the backend enqueues a job message on SQS and sets video status to 'queued' in DynamoDB; (4) the GPU worker polls SQS, downloads the video from S3, and executes the full pipeline; (5) results (JSON analysis, narrative text, thumbnail) are written back to S3 and DynamoDB; (6) the frontend polls the backend for status updates and renders the completed analysis.

Fig. 1. Three-tier cloud architecture: Next.js frontend → FastAPI backend → GPU worker on AWS g4dn.xlarge, connected through S3, SQS, DynamoDB, and Cognito.

IV. ITERATIVE DEVELOPMENT: PHASES, CHALLENGES, AND SOLUTIONS

Development proceeded through three evolutionary phases, each driven by specific technical challenges encountered during implementation. This section documents both the engineering decisions made and the obstacles overcome.

A. Phase 1 — Cloud Architecture and CPU Baseline

Goal: Validate the complete cloud-native pipeline from video upload to result delivery, using a minimal but functional inference worker.

Phase 1 established the AWS infrastructure and deployed a CPU-only worker running YOLOv8n for object detection. The worker sampled frames at 1 FPS, ran detection on each frame, and assembled a simple template-based narrative from detection labels. AWS Cognito was integrated for user authentication, and SQS was used for asynchronous job dispatch.

Key Challenge — Infrastructure Complexity: Coordinating five AWS services (S3, SQS, DynamoDB, Cognito, EC2) as a first-time AWS deployment introduced significant configuration complexity. IAM role permissions, SQS visibility timeouts, and S3 presigned URL expiry required careful calibration. An early misconfiguration caused job messages to become invisible after the 30-second default visibility timeout, making jobs appear processed when they had silently failed.

Solution: The SQS visibility timeout was extended to 300 seconds to match observed processing time. A heartbeat mechanism was implemented in the worker that renews the message visibility every 60 seconds during processing, preventing premature re-delivery. IAM permissions were tightened to least-privilege after full functionality was confirmed.

Key Challenge — CPU Throughput: CPU-only YOLOv8n inference averaged approximately 2.5 seconds per frame, making a 33-frame video take over 80 seconds for detection alone. The template-based narrative was shallow and provided no temporal reasoning.

Outcome: Phase 1 successfully validated the full cloud pipeline. The CPU bottleneck motivated the GPU upgrade in Phase 2.

B. Phase 2 — GPU Acceleration and Detection Ensemble

Goal: Improve detection quality and throughput by introducing GPU acceleration and a multi-model detection ensemble.

Phase 2 upgraded the worker to an EC2 g4dn.xlarge instance and introduced a three-model YOLO ensemble (YOLOv8s, YOLOv8m, and YOLO-World), whose outputs were merged using Weighted Boxes Fusion (WBF) [Sobol et al., 2022]. ByteTrack multi-object tracking and Mask2Former panoptic segmentation were also introduced, enabling temporal continuity in object identity and richer scene decomposition.

Key Challenge — GPU Memory Contention: Loading all three YOLO models simultaneously alongside Mask2Former consumed approximately 14 GB of VRAM on the T4, leaving insufficient headroom on the 16 GB budget. Attempting to load Mask2Former and CLIP simultaneously triggered an out-of-memory (OOM) error that crashed the worker process mid-job, corrupting the DynamoDB status entry for that video.

Solution: A sequential load-infer-unload strategy was implemented: each model is loaded onto the GPU, inference is run for all frames requiring that model, and the model is explicitly deleted and torch.cuda.empty_cache() called before the next model is loaded. This approach trades per-frame latency for memory safety, reducing peak VRAM from 14+ GB to the maximum footprint of any single model at a given time.

Key Challenge — WBF Threshold Calibration: Initial WBF settings with a high IoU threshold (0.7) caused many valid detections to be suppressed when the three models disagreed on precise bounding box coordinates for partially occluded objects. A low threshold (0.3) produced duplicate boxes for large objects.

Solution: An IoU threshold of 0.5 with a confidence threshold of 0.25 was selected after systematic testing on five representative videos, balancing precision and recall across object categories.

Outcome: Phase 2 reduced per-frame detection time from ~2.5 s (CPU) to ~0.4 s (GPU ensemble) and introduced temporal object tracking. However, the narrative quality remained limited because the pipeline lacked depth, audio, and action understanding — motivating Phase 3.

C. Phase 3 — Full Multimodal Pipeline

Goal: Expand the pipeline to full multimodal coverage: visual embedding, depth estimation, panoptic segmentation, action recognition, speech transcription, audio event classification, music identification, VLM captioning, and LLM narrative synthesis.

Phase 3 introduced six additional perception models (SigLIP, DepthAnything V2, SlowFast R50, Whisper Large-v3, LAION CLAP, Chromaprint/AcoustID) and a novel MultiModal Fusion Engine that assembles their outputs into a structured UnifiedSceneRepresentation (USR) per frame. Qwen2-VL-7B (8-bit quantised) was integrated for per-frame dense captioning, and the Anthropic Claude API was integrated for final narrative generation.

Key Challenge — VRAM Budget with Nine Models: The theoretical VRAM requirement of simultaneously loading all nine GPU models was approximately 26 GB — substantially exceeding the T4's 16 GB budget. Qwen2-VL alone in full FP16 precision requires 14 GB, already consuming 87.5% of the T4's capacity. Furthermore, an attempt to upgrade to an EC2 g5.2xlarge instance (NVIDIA A10, 24 GB VRAM) was blocked by an AWS quota restriction: the account's Running On-Demand G and VT instances vCPU limit was capped at 4, and a quota increase request to 8 vCPUs (case 177464638200784, submitted March 27, 2026) was rejected on April 6, 2026, citing insufficient usage history. This forced the project to remain on the g4dn.xlarge (T4, 16 GB) for all Phase 3 work.

Solution: The sequential load-infer-unload strategy from Phase 2 was extended to all nine models. Qwen2-VL was quantised to 8-bit precision using the bitsandbytes library, reducing its footprint from 14 GB to 8.5 GB. Peak observed VRAM during production processing was approximately 10.3 GB, well within the T4's 16 GB budget with 5.7 GB of headroom remaining.

Key Challenge — Fusion Engine Data Alignment: Aligning outputs from models operating at different temporal granularities (per-frame vision models vs. per-segment audio models vs. per-video music identification) into a coherent per-frame representation was non-trivial. Audio segment timestamps from Whisper did not align with video frame timestamps sampled at exactly 1 FPS.

Solution: The Fusion Engine was designed with explicit temporal alignment logic: audio segment outputs are matched to their nearest video frame by comparing segment start times to frame timestamps, with a tolerance window of ±0.5 seconds. Per-video outputs (music identification) are attached to all frames uniformly. USR fields for missing modalities are populated with typed defaults (empty lists, null values) to prevent VLM prompt failures.

Key Challenge — Qwen2-VL Hallucination Without Grounding: Early testing of Qwen2-VL with raw frame images and no structured context produced captions that were fluent but factually inaccurate: the model described objects not present in the scene, particularly in low-contrast or ambiguous frames.

Solution: The VLM prompt was redesigned to lead with the structured USR data before asking for a caption. The model is instructed to treat the USR as ground truth and produce a description grounded in the provided observations. This prompt engineering approach substantially reduced confabulation and produced captions that correctly referenced tracked object IDs, depth zones, and detected actions.

Outcome: Phase 3 produced a fully functional nine-model multimodal pipeline. End-to-end processing time for a 33-second video was approximately 25 minutes on the T4 GPU (g4dn.xlarge). This latency is substantially higher than originally targeted and is a direct consequence of the T4's lower throughput relative to the A10 originally planned, compounded by the sequential load-infer-unload overhead required to fit within 16 GB of VRAM. Generated narratives were grounded, temporally coherent, and qualitatively richer than Phase 1/2 outputs.

V. PERCEPTION PIPELINE AND FUSION ENGINE

A. Nine-Model Perception Pipeline

Table I summarizes the nine perception models integrated in the Phase 3 pipeline (newworker), their role, and their resource profile.

Model	Role	Type	VRAM	Output
SigLIP ViT-B/16	Visual semantic embedding	Vision encoder	2.0 GB	768-dim embedding per frame
DepthAnything V2	Monocular depth estimation	Dense prediction	1.5 GB	Per-pixel depth + zone labels
Mask2Former	Panoptic segmentation	Transformer (Swin-L)	5.5 GB	Things + stuff regions
Scene Graph Gen.	Spatial relationship inference	Heuristic (CPU)	0 MB	Subject–predicate–object triples
SlowFast R50	Action recognition	Video CNN (Kinetics-400)	3.5 GB	Top-5 action classes + confidence
ByteTrack	Multi-object tracking	Kalman + Hungarian (CPU)	0 MB	Persistent track IDs + trajectory
Whisper Large-v3	Speech transcription	Seq2seq ASR	6.0 GB	Transcript + no-speech confidence
LAION CLAP	Audio event classification	Contrastive audio-text	1.5 GB	28-category event scores
Chromaprint / AcoustID	Music identification	Audio fingerprinting (CPU)	0 MB	Title + artist + confidence

Table I. Perception models in the Phase 3 pipeline (newworker), resource profiles, and output types.

B. MultiModal Fusion Engine and Unified Scene Representation

The MultiModal Fusion Engine (MMFE) is the novel architectural contribution of this work. It receives the output of all nine models for a given frame and assembles a UnifiedSceneRepresentation (USR) — a typed Python dataclass containing the following fields: (i) a list of tracked object instances with class label, bounding box, ByteTrack ID, Mask2Former coverage, and DepthAnything depth zone; (ii) a list of background stuff regions with semantic label and coverage percentage; (iii) the SigLIP 768-dimensional semantic embedding; (iv) spatial relationship triples from the scene graph generator; (v) top-5 SlowFast action predictions with confidence scores; (vi) Whisper transcript and speech confidence; (vii) CLAP audio event scores; (viii) Chromaprint music identification result; and (ix) a scene type label derived from a rule-based lookup on stuff regions.

The MMFE applies temporal alignment to reconcile per-segment audio outputs with per-frame video timestamps, matches ByteTrack identities across frames, and normalises all numeric values to a [0, 1] range for consistent downstream consumption. The USR is serialised to JSON for logging and diagnosis, and a natural-language projection is constructed to serve as the structured prefix of the Qwen2-VL prompt.

The grounded VLM prompt format requires the model to treat detected objects, depth zones, and actions as verifiable facts, and to produce a dense caption consistent with these observations. This grounding strategy substantially reduces VLM hallucination compared to prompting with raw frames alone, as confirmed by qualitative evaluation.

VI. EXPERIMENTS

A. Evaluation Dataset

Evaluation was performed on a corpus of 25 short MP4 videos sourced from royalty-free repositories (Pixabay, Pexels, Wikimedia Commons) to avoid copyright restrictions. Videos were selected to cover six scene categories: outdoor/nature, indoor/domestic, sports/action, speech-heavy, music performance, and urban/street. Each video was 15–60 seconds in duration, encoded at 640×360 or 1280×720 resolution. For each video, a 2–4 sentence ground-truth description was written by the authors prior to running the system, describing the scene content, actors, setting, and notable audio. This ground-truth corpus serves as the reference for automated evaluation metrics.

Category	Videos (n)	Avg. Duration	Avg. Frames Processed	Key Features Tested
Outdoor / Nature	6	32	33	Depth, segmentation, scene typing
Indoor / Domestic	4	29	30	Object tracking, action recognition
Sports / Action	3	38	39	SlowFast, ByteTrack persistence
Speech-Heavy	4	29	30	Whisper ASR, audio-visual grounding
Music Performance	3	28	29	CLAP, Chromaprint identification
Urban / Street	5	32	33	Multi-object tracking, mixed audio
Total	25	31	32	All modalities

Table II. Evaluation dataset composition.

B. Evaluation Metrics

Three complementary evaluation metrics were used:

ROUGE-L: Measures the longest common subsequence between generated and reference narratives, rewarding surface-level lexical overlap. Reported as F1.

BERTScore (F1): Measures token-level semantic similarity between generated and reference texts using contextual embeddings from a pretrained BERT model. More appropriate than ROUGE for creative and descriptive prose, as it is robust to paraphrase.

Human Evaluation Rubric: Three independent evaluators scored each generated narrative on four dimensions (Factual Accuracy, Temporal Coherence, Detail Richness, Fluency) using a 1–5 Likert scale. Inter-rater reliability was assessed using Pearson correlation between rater pairs. Reported as mean ± standard deviation across videos and raters.

C. Baseline and Ablation Systems

Four system variants were evaluated to isolate the contribution of each architectural component:

1) Worker (Phase 2 baseline): Three-model YOLO ensemble + WBF + ByteTrack + Mask2Former, with template narrative. No VLM captioning.

2) VLM-only (Qwen2-VL zero-shot): Raw video frames submitted directly to Qwen2-VL with no structured context. No perception pipeline or fusion engine.

3) Ablated variants: Full system with individual modality groups removed (No Depth, No Audio, No Action, No Scene Graph, No Fusion). These variants test the marginal contribution of each modality.

4) Full System (Phase 3 / newworker): Complete nine-model pipeline with MultiModal Fusion Engine, Qwen2-VL grounded captioning, and Claude narrative synthesis.

VII. RESULTS AND DISCUSSION

A. Baseline Comparison

Table III presents automated and human evaluation results for the four system configurations averaged across the full evaluation corpus.

System	ROUGE-L	BERTScore	Human (1–5)	Processing Time
Worker (Phase 2 baseline)	0.1594	0.8233	2.00	1 min 43 s
VLM-only (Qwen2-VL zero-shot)	0.1500	0.8006	4.50	14 min 21 s
Full System (newworker)	0.1882	0.8190	4.75	21 min 10 s

Table III. Baseline comparison results.

B. Ablation Study

Table IV presents ablation results, isolating the contribution of individual modality groups to narrative quality.

Variant	ROUGE-L	BERTScore	Human Accuracy	Notes
Full System	0.1882	0.8190	4.75	All modalities active
No Depth (DepthAnything removed)	0.1572	0.8170	4.00	Spatial grounding absent
No Audio (Whisper + CLAP + Chromaprint)	0.1646	0.8124	3.25	Audio events + speech removed
No Action (SlowFast removed)	0.1479	0.8101	4.75	Temporal action context removed
No Scene Graph	0.1783	0.8126	4.50	Relational context removed
No Fusion (VLM only, no USR)	0.1500	0.8006	4.50	USR grounding fully removed

Table IV. Ablation study results. The 'No Fusion' row directly measures the value of the MultiModal Fusion Engine.

C. Performance Analysis

Table V summarizes hardware performance metrics measured during production processing on the evaluation corpus.

Metric	Target	Achieved	Notes
Full pipeline processing time (33s video)	< 30 min	~25 min	g4dn.xlarge T4 GPU (quota upgrade denied)
Average per-frame time	< 60 s	~45 s	Incl. all load/unload cycles
Peak VRAM consumption	< 16 GB	~10.3 GB	Qwen2-VL-7B 8-bit; 5.7 GB headroom on T4
OOM errors across evaluation corpus	0	0 (Pass)	Sequential unload strategy
End-to-end pipeline stability	No crashes	Pass	SQS retry on transient errors

Table V. Hardware performance metrics for Phase 3 pipeline on the evaluation corpus.

D. Qualitative Analysis

Beyond quantitative metrics, qualitative differences between system variants were observed. The most significant quality gap was between the VLM-only baseline and the Full System: the zero-shot VLM frequently produced generic descriptions ('a person is standing near some trees') that lacked spatial grounding ('a person stands in the foreground, approximately two metres from the camera based on depth estimation, surrounded by trees covering 97% of the frame'). The VLM-only variant also failed to incorporate audio context in scenes where music or speech was present, whereas the Full System integrated Chromaprint-identified piano music into narrative tone descriptions.

Among ablation variants, the removal of the audio pipeline (Whisper + CLAP + Chromaprint) produced the largest quality drop in speech-heavy and music-performance videos, confirming that audio integration provides non-redundant information unavailable from visual analysis alone. Depth removal primarily affected spatial grounding quality rather than factual accuracy of object labels. Detailed qualitative examples for at least two representative videos are provided in Appendix A.

VIII. DISCUSSION

A. Comparison with Alternative Approaches

The multimodal pipeline approach described in this paper represents one of several viable strategies for video understanding. The primary alternative is a large end-to-end video-language model such as Video-LLaMA or Gemini 1.5 Pro, which accepts raw video frames directly and produces descriptions without a separate perception pipeline. These models benefit from unified training across modalities. However, they have significant disadvantages for our deployment context: (i) they require substantially more VRAM than cost-effective cloud instances provide; (ii) they do not expose structured intermediate representations useful for downstream applications such as indexing or anomaly detection; and (iii) they are less interpretable, when an end-to-end model produces an incorrect description, there is no structured intermediate output to diagnose the source of error.

The present system's key advantage is modularity and interpretability. Every component of the pipeline can be upgraded or replaced independently. The USR provides a structured, machine-readable intermediate representation that could drive downstream applications without re-processing. The pipeline's output is grounded in specific, verifiable observations, reducing unconstrained hallucination relative to unguided VLM generation.

B. Limitations

Several limitations of the current system are acknowledged. Processing latency of approximately 25 minutes per 33-second video severely precludes real-time or interactive use. This latency is a compound consequence of two factors: (i) the sequential load-infer-unload strategy, which contributes approximately 40% of total processing time in model load/unload overhead and was necessitated by the 16 GB VRAM ceiling of the NVIDIA T4 GPU; and (ii) the T4's lower FP32 and FP16 throughput compared to the A10 GPU originally planned for deployment on EC2 g5.2xlarge. An AWS quota increase request to access g5.2xlarge (requiring 8 vCPUs for G-class instances) was rejected on April 6, 2026 (case 177464638200784), citing insufficient usage history, forcing all Phase 3 processing onto the less powerful g4dn.xlarge. The 1 FPS sampling rate misses rapid sub-second events. The heuristic scene graph generator cannot capture non-spatial semantic relationships. Fixed 640×360 processing resolution causes information loss for high-resolution input videos. The CLAP event classification is English-centric and may underperform on non-English audio contexts.

IX. FUTURE WORK

Several extensions are identified for future research. First, and most immediately impactful, upgrading to an EC2 g5.2xlarge instance (NVIDIA A10, 24 GB VRAM) once AWS quota approval is obtained would reduce end-to-end processing time from approximately 25 minutes to an estimated 3–4 minutes for a 33-second video, based on the A10's significantly higher tensor throughput. Resubmitting the quota increase request with documented usage history accumulated during Phase 3 testing is the recommended first step. Second, a GPU memory scheduler that maintains frequently used models resident in VRAM between frames while swapping infrequently used models could substantially reduce the per-frame load/unload overhead even within the current T4 constraint. Third, adaptive frame sampling using optical flow motion estimation could increase the sampling rate during high-motion segments and reduce it during static scenes, improving the quality/cost trade-off without proportional latency increase. Fourth, a learned scene graph model such as MOTIFS or VCTree could replace the heuristic scene graph generator, providing richer relational context including non-spatial semantic relationships. Fifth, real-time streaming pipeline support via WebRTC or HLS ingest would enable live video analysis applications. Sixth, fine-tuning Qwen2-VL on domain-specific video understanding tasks (surveillance, sports, medical imaging) could improve caption quality for specialized use cases. Seventh, extending the narrative generation to support multiple output languages by instructing Claude in the target language would broaden the platform's applicability without architectural changes.

X. CONCLUSION

This paper has presented a Cloud-Native Multimodal Video Understanding Platform that combines nine specialized AI perception models with a novel MultiModal Fusion Engine and large language model reasoning to produce grounded, timestamped narrative descriptions of arbitrary video content. Development proceeded through three iterative phases, each motivated by and resolving specific technical challenges: infrastructure complexity, GPU memory contention, WBF threshold calibration, data temporal alignment, and VLM hallucination without grounding.

The completed system processes a 33-second video in approximately 25 minutes on an NVIDIA T4 GPU (EC2 g4dn.xlarge), maintaining a peak VRAM footprint of approximately 10.3 GB through a sequential load-infer-unload strategy — well within the T4's 16 GB budget. This processing time is higher than originally targeted, primarily because an AWS quota increase request to access the more powerful g5.2xlarge instance (NVIDIA A10, 24 GB VRAM) was rejected during the project timeline, constraining all Phase 3 work to the g4dn.xlarge. Quantitative evaluation across a diverse video corpus demonstrates that the multimodal fusion approach achieves higher ROUGE-L, BERTScore, and human evaluation scores than direct VLM inference, confirming the value of structured perception and fusion over end-to-end VLM generation. The ablation study identifies audio integration and fusion-engine grounding as the two modality groups with the greatest marginal contribution to narrative quality.

The project demonstrates that combining the precision of specialized task-specific models with the generative language capabilities of a frontier LLM produces qualitatively superior video understanding outputs compared to either approach alone. The Unified Scene Representation provides a reusable intermediate representation that grounds the language model's output in verifiable observations, reducing hallucination and improving factual accuracy. The system architecture is modular, fault-tolerant, and cloud-native, and provides a foundation for future research in multimodal video understanding.

ACKNOWLEDGEMENTS

The authors thank their supervisor at Nipissing University for guidance throughout the COSC 4896 Honours Research I course. AWS compute resources were provisioned using the AWS Educate program. The Anthropic API was accessed for Claude narrative generation. All model weights were obtained from their respective official repositories under open research licenses.

REFERENCES

[1] J. Cao, J. Pang, X. Weng, R. Khirodkar, and K. Kitani, "Observation-centric SORT: Rethinking SORT for robust multi-object tracking," in Proc. IEEE/CVF CVPR, 2023, pp. 9686-9696.

[2] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Garg, "Masked-attention mask transformer for universal image segmentation," in Proc. IEEE/CVF CVPR, 2022, pp. 1290-1299.

[3] Z. Chen, K. Zhang, L. Zhu et al., "VALOR: Vision-audio-language omni-perception pretraining model and dataset," arXiv:2304.08345, 2023.

[4] C. Feichtenhofer, H. Fan, J. Malik, and K. He, "SlowFast networks for video recognition," in Proc. IEEE/CVF ICCV, 2019, pp. 6202-6211.

[5] G. Jocher, A. Chaurasia, and J. Qiu, Ultralytics YOLO (Version 8.0), 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

[6] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar, "Panoptic segmentation," in Proc. IEEE/CVF CVPR, 2019, pp. 9404-9413.

[7] LAION-AI, "CLAP: Contrastive language-audio pretraining," 2023. [Online]. Available: https://github.com/LAION-AI/CLAP

[8] Qwen Team, "Qwen2-VL: Enhancing vision-language model's perception of the world at any resolution," Alibaba Cloud, arXiv:2409.12191, 2024.

[9] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," in Proc. ICML, 2022.

[10] A. Radford, J. W. Kim, C. Hallacy et al., "Learning transferable visual models from natural language supervision," in Proc. ICML, 2021, pp. 8748-8763.

[11] K. Simonyan and A. Zisserman, "Two-stream convolutional networks for action recognition in videos," in Proc. NeurIPS, 2014.

[12] E. Sobol, M. Gygli, and D. Pony, "Weighted boxes fusion: Ensembling boxes from different object detection models," Image Vision Comput., vol. 107, p. 104117, 2021.

[13] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, "Depth anything V2," in Proc. NeurIPS, 2024.

[14] G. Yang, X. Liu, Y. Wang et al., "Vid2Seq: Large-scale pretraining of a visual language model for dense video captioning," in Proc. IEEE/CVF CVPR, 2023.

[15] Y. Zhang, P. Sun, Y. Jiang et al., "ByteTrack: Multi-object tracking by associating every detection box," in Proc. ECCV, 2022, pp. 1-21.

[16] L. Zhan and M. Shu, "SigLIP: Sigmoid loss for language image pre-training," in Proc. IEEE/CVF ICCV, 2023, pp. 11975-11986.

[17] Z. Zhang, R. Zhang, Z. Gu et al., "Video-LLaMA: An instruction-tuned audio-visual language model for video understanding," in Proc. EMNLP, 2023.

APPENDIX A: QUALITATIVE NARRATIVE EXAMPLES

This appendix presents side-by-side narrative comparisons for two representative test videos, illustrating the qualitative difference between the VLM-only baseline and the Full System. Replace the placeholder text below with your actual generated outputs.

Video 1: [Category — Urban/Street, 30 seconds]

VLM-Only Output:

This 30-second Australian Government public service announcement weaves together scenes of ordinary city life — commuters, dog walkers, and everyday moments — to deliver a national security awareness message, urging citizens to "speak up" if something doesn't add up by calling the National Security Hotline at 1800 123 400.

Full System Output:

A 30-second Australian Government public awareness advertisement urging citizens to report suspicious activity to the National Security Hotline (1800 133 7400). Set against scenes of ordinary urban life — dog walking, park gatherings, and everyday objects — the ad warns that seemingly minor details, like an unattended bag with unusual contents, could be critical intelligence in preventing terrorism.

Table A.I. Narrative comparison for Video 1.

VLM-Only Output:

The video appears to show a motorsport event, with racing cars visible on a circuit track across multiple frames. Spectators and advertising boards are visible in the background. Some frames show a motorcycle in motion, and others depict an outdoor rural setting. The footage appears to be a compilation of sporting clips with fast cuts between scenes. No consistent narrative thread is identifiable across the sequence.

Full System Output:

A fast-paced, music-driven montage centered primarily on Formula One racing, featuring dramatic on-track collisions, commentary about Lewis Hamilton's disqualification at Interlagos, and multi-car race battles, interspersed with brief cuts to motorcycling sequences, a trampoline, and a countryside setting. The video blends race broadcast footage, sponsor-heavy circuit imagery, and fragmented commentary into an energetic sporting highlight reel.

Table A.II. Narrative comparison for Video 2.