We evaluate LongVideoAgent on LongTVQA and LongTVQA+, which are episode-level datasets.
Main Results
Performance on LongTVQA and LongTVQA+. The left block lists model attributes (Agentic, Input, RL fine-tune); the right block reports validation accuracy (%). GPT-4o and Gemini-2.5 Pro are multimodal baselines that process and accept the full long video directly. Methods labeled Agentic indicate the model operates as the MasterAgent; methods labeled AgenticRL additionally denote RL fine-tuning. Parenthesized green numbers denote absolute gains over the immediately preceding (non-agentic or non-RL) setting. We observe that: (i) our multi-agent framework, LongVideoAgent, consistently outperforms the non-agentic counterparts; (ii) agentic RL yields additional gains, especially for smaller open-source models; (iii) using frames provides visual evidence beyond subtitles, and generally outperforms subtitle-only inputs; (iv) closed-source models remain strong, but the gap narrows much when open-source models adopt agentic designs and agentic RL.
Ablation Analysis
We conduct comprehensive ablation studies to validate our design choices. First, we observe that both grounding and vision agents are essential, with the full multi-agent system achieving the highest accuracy. Second, increasing the reasoning step limit \(K\) improves performance until saturation, confirming the value of iterative planning. Finally, stronger vision backbones and larger temporal windows provide richer context, further boosting the agent's reasoning capabilities.