LongVideoAgent Docs¶

LongVideoAgent is a multi-agent framework for reasoning over long videos. A master LLM iteratively coordinates a grounding agent and a vision agent to localize relevant clips, inspect visual evidence, and answer long-video QA questions with interpretable multi-step traces.

Project Page | Paper | Dataset: LongTVQA | Dataset: LongTVQA+

Overview¶

Recent long-video QA systems often rely on lossy summarization or limited tool use, which weakens temporal grounding and misses fine-grained cues. LongVideoAgent instead uses a bounded multi-agent reasoning loop:

The MasterAgent plans the next step and decides when enough evidence has been collected.
The GroundingAgent localizes question-relevant video clips from subtitles.
The VisionAgent inspects sampled frames from the localized clips and returns targeted visual observations.

This repository documents the codebase for:

Running quickstart GRPO training for the master agent.
Building offline grounding caches and converting datasets into training-ready formats.
Running unified local/API evaluation on LongTVQA and LongTVQA+.

Documentation Guide¶

Start with Installation to set up the environment.
Use Quickstart for the shortest end-to-end training path.
See Evaluation for local and API-based evaluation scripts.
Use the training pages for GRPO config details, offline grounding cache generation, and LoRA adapter merging.

Method Summary¶

LongVideoAgent operates in a bounded loop with up to K reasoning rounds. At each step, the master agent emits a structured action:

<request_grounding> to search for relevant clips.
<visual_query> to inspect selected clips with the vision agent.
<answer> to terminate and return the final option.

The master agent is optimized with reinforcement learning so that trajectories remain structurally valid, concise, and correct.