LongVideoAgent Docs¶
LongVideoAgent is a multi-agent framework for reasoning over long videos. A master LLM iteratively coordinates a grounding agent and a vision agent to localize relevant clips, inspect visual evidence, and answer long-video QA questions with interpretable multi-step traces.
Project Page | Paper | Dataset: LongTVQA | Dataset: LongTVQA+
Overview¶
Recent long-video QA systems often rely on lossy summarization or limited tool use, which weakens temporal grounding and misses fine-grained cues. LongVideoAgent instead uses a bounded multi-agent reasoning loop:
- The MasterAgent plans the next step and decides when enough evidence has been collected.
- The GroundingAgent localizes question-relevant video clips from subtitles.
- The VisionAgent inspects sampled frames from the localized clips and returns targeted visual observations.
This repository documents the codebase for:
- Running quickstart GRPO training for the master agent.
- Building offline grounding caches and converting datasets into training-ready formats.
- Running unified local/API evaluation on LongTVQA and LongTVQA+.
Documentation Guide¶
- Start with Installation to set up the environment.
- Use Quickstart for the shortest end-to-end training path.
- See Evaluation for local and API-based evaluation scripts.
- Use the training pages for GRPO config details, offline grounding cache generation, and LoRA adapter merging.
Method Summary¶
LongVideoAgent operates in a bounded loop with up to K reasoning rounds. At each step, the master agent emits a structured action:
<request_grounding>to search for relevant clips.<visual_query>to inspect selected clips with the vision agent.<answer>to terminate and return the final option.
The master agent is optimized with reinforcement learning so that trajectories remain structurally valid, concise, and correct.