Evaluation¶
Local Agent Evaluation (TVQA+)¶
This section covers local evaluation for the full LongVideoAgent pipeline on TVQA+, using a local main model (vLLM) and API-based grounding/vision agents.
Script¶
src/evaluation/lvagent/tvqa_plus/evaluate_local.py
Usage¶
python src/evaluation/lvagent/tvqa_plus/evaluate_local.py \
--checkpoint_step 100 \
--max_turn 5 \
--gpu_memory_utilization 0.4 \
--model-dir-template "../videoagent-step-{checkpoint_step}" \
--questions-path "../Tvqa_data/tvqa_plus_val_sampled_300.json" \
--subs-path "../Tvqa_data/all_episodes_subtitles_by_clips.json" \
--base-frame-dir "../Tvqa_data/bbt_frames" \
--output-filename-template "./eval_videoagent-{checkpoint_step}.json" \
--grounding-model "grok-4-fast-reasoning" \
--vision-model "grok-4-fast-reasoning" \
--grounding-base-url "https://api2.aigcbest.top/v1" \
--vision-base-url "https://dashscope.aliyuncs.com/compatible-mode/v1"
Key Arguments¶
--model-dir-template: Main model path template. Must contain{checkpoint_step}.--questions-path: TVQA+ questions JSON path.--subs-path: subtitles JSON path.--base-frame-dir: frame root directory.--output-filename-template: output JSON template. Must contain{checkpoint_step}.--grounding-model: API model name used by the grounding agent.--vision-model: API model name used by the vision agent.--grounding-base-url: grounding API endpoint.--vision-base-url: vision API endpoint.--grounding-api-key: optional; if omitted, reads envqdd_api.--vision-api-key: optional; if omitted, reads envaliyun_api.
Notes¶
- This refactor keeps evaluation logic and prompts unchanged, and only improves configuration/CLI ergonomics.
- Input file and directory validation is performed at startup.
Local Agent Evaluation (TVQA / LongTVQA)¶
This section covers local evaluation for TVQA with the LongTVQA dataset format from Hugging Face.
Script¶
src/evaluation/lvagent/tvqa/evaluate_local.py
Usage¶
python src/evaluation/lvagent/tvqa/evaluate_local.py \
--checkpoint_step 100 \
--max_turn 5 \
--gpu_memory_utilization 0.4 \
--model-dir-template "../tvqa_qwen_3b-step-{checkpoint_step}" \
--questions-path "../Tvqa_data/LongTVQA_val.jsonl" \
--subs-path "../Tvqa_data/LongTVQA_subtitles_clip_level.jsonl" \
--base-frame-dir "../Tvqa_data/frames" \
--output-filename-template "./eval_tvqa-{checkpoint_step}.json" \
--grounding-model "grok-4-fast-reasoning" \
--vision-model "grok-4-fast-reasoning" \
--grounding-base-url "https://api2.aigcbest.top/v1" \
--vision-base-url "https://dashscope.aliyuncs.com/compatible-mode/v1"
Notes¶
- The evaluator keeps the original reasoning/prompt logic unchanged and only adds data-format compatibility.
- LongTVQA field mapping is handled automatically:
occur_clip -> vid_name,answer -> answer_idx. - Both JSON array and line-delimited JSONL inputs are supported for question files.
Dataset Download & Preparation (LongTVQA / LongTVQA+)¶
Use the one-shot script below to download both Hugging Face datasets and prepare evaluator-ready files:
bash scripts/download_and_prepare_longtvqa.sh
Default outputs are generated under ../Tvqa_data:
LongTVQA_val_normalized.jsonlLongTVQA_plus_val_normalized.jsonclip_bbox_mapping.json
Notes:
- Downloader uses
hfif available, otherwise falls back tohuggingface-cli. - You can override locations with env vars:
DATA_ROOT(raw downloaded dataset dir, default../Tvqa_data/hf_datasets)OUTPUT_DIR(prepared output dir, default../Tvqa_data)
Unified API Evaluation (TVQA / TVQA+)¶
This section covers a unified API evaluator that supports both datasets with one script.
Script¶
src/evaluation/lvagent/evaluate_api_unified.py
Usage¶
Run on TVQA:
python src/evaluation/lvagent/evaluate_api_unified.py \
--dataset tvqa \
--checkpoint_step api \
--max_turn 5 \
--threads 30
Run on TVQA+:
python src/evaluation/lvagent/evaluate_api_unified.py \
--dataset tvqa_plus \
--checkpoint_step api \
--max_turn 5 \
--threads 30
Key Arguments¶
--dataset:tvqaortvqa_plus(required).--questions-path: optional override for questions file path; dataset-specific default is used if omitted.--subs-path: optional override for subtitles file path; dataset-specific default is used if omitted.--base-frame-dir: optional override for frame root directory; dataset-specific default is used if omitted.--bbox-json-path: bbox JSON path (used fortvqa_plus; ignored bytvqa).--output-filename: summary output JSON filename.--detailed-output-filename: detailed conversation log JSON filename.--grounding-model,--vision-model,--main-model: model names for grounding/vision/main API calls.--grounding-base-url,--vision-base-url,--main-base-url: API endpoint base URLs.--grounding-api-key,--vision-api-key,--main-api-key: API keys (defaults:qdd_api,aliyun_api,qdd_apienv vars).
Notes¶
- The unified script keeps one evaluation pipeline and switches only dataset-specific input handling (question format, subtitle keying, frame/bbox access).
- Core multi-turn evaluation logic remains shared across TVQA and TVQA+.
Unified Local Evaluation (TVQA / TVQA+)¶
This section covers a unified local evaluator that supports both datasets with one script, using local vLLM as the master agent.
Script¶
src/evaluation/lvagent/evaluate_local_unified.py
Usage¶
Run on TVQA:
python src/evaluation/lvagent/evaluate_local_unified.py \
--dataset tvqa \
--llm-path "/path/to/your/local_llm" \
--max_turn 5 \
--gpu_memory_utilization 0.4
Full example (recommended for reproducible runs):
python src/evaluation/lvagent/evaluate_local_unified.py \
--dataset tvqa_plus \
--llm-path "/models/videoagent-step-100" \
--max_turn 5 \
--gpu_memory_utilization 0.4 \
--questions-path "../Tvqa_data/tvqa_plus_val.json" \
--subs-path "../Tvqa_data/all_episodes_subtitles_by_clips.json" \
--base-frame-dir "../Tvqa_data/bbt_frames" \
--bbox-json-path "../Tvqa_data/clip_bbox_mapping.json" \
--output-filename "./results/tvqa_plus_step100_summary.json" \
--detailed-output-filename "./results/tvqa_plus_step100_detail.json" \
--grounding-model "grok-4-fast-reasoning" \
--vision-model "gpt-4o" \
--grounding-base-url "https://api2.aigcbest.top/v1" \
--vision-base-url "https://dashscope.aliyuncs.com/compatible-mode/v1" \
--grounding-api-key "$qdd_api" \
--vision-api-key "$aliyun_api" \
--grounding-cache-json-path "../Tvqa_data/grounding_pairs_6000_samples.json"
Run on TVQA+:
python src/evaluation/lvagent/evaluate_local_unified.py \
--dataset tvqa_plus \
--llm-path "/path/to/your/local_llm" \
--max_turn 5 \
--gpu_memory_utilization 0.4
Key Arguments¶
--dataset:tvqaortvqa_plus(required).--llm-path: local model directory path used by vLLM (required).--questions-path: optional override for questions file path; dataset-specific default is used if omitted.--subs-path: optional override for subtitles file path; dataset-specific default is used if omitted.--base-frame-dir: optional override for frame root directory; dataset-specific default is used if omitted.--bbox-json-path: bbox JSON path (used fortvqa_plus; ignored bytvqa).--output-filename: summary output JSON filename.--detailed-output-filename: detailed conversation log JSON filename.--grounding-model,--vision-model: model names for grounding/vision API calls.--grounding-base-url,--vision-base-url: API endpoint base URLs.--grounding-api-key,--vision-api-key: API keys (defaults:qdd_api,aliyun_apienv vars).--grounding-cache-json-path: optional grounding cache JSON; unmatched questions fall back to grounding API.
Baseline Evaluation¶
This section outlines the baseline evaluation script for the LongVideoAgent repository, located in src/evaluation/baseline. The script evaluates a model's performance on the TVQA dataset (or variants like TVQA+) using an API-based model for question-answering tasks.
Overview¶
The evaluate_api_tvqa.py script processes video frames, subtitles, and questions to generate answers via an API-based model (e.g., Gemini). It supports input validation, multi-threaded processing, and result output. Dataset variants are handled by providing appropriate input files.
Directory Structure¶
src/evaluation/baseline/evaluate_api_tvqa.py: Runs evaluation on TVQA or its variants using an API-based model.utils.py: Contains utility functions, includingrun_simple_qa, for evaluation logic.
Script¶
evaluate_api_tvqa.py¶
Evaluates a model on the TVQA dataset (or variants like TVQA+) by calling an API. The model name corresponds to the API provider's model (e.g., gemini-2.5-pro-exp-03-25).
Usage:
python src/evaluation/baseline/evaluate_api_tvqa.py \
--questions-path ./tvqa_question.json \
--subs-path ./substitle.json \
--output-filename ./eval_gemini_tvqa.json \
--base-frame-dir ../bbt_frames \
--model gemini-2.5-pro-exp-03-25 \
--threads 10
Arguments:
--questions-path: Path to the TVQA questions JSON file (required).--subs-path: Path to the subtitles JSON file (required).--output-filename: Path for saving results (required).--base-frame-dir: Directory with video frame images (required).--model: API provider's model name (required).--threads: Number of threads for parallel processing (required; adjust based on API rate limits).
Behavior:
- Validates input paths (questions, subtitles, frames).
- Calls
run_simple_qafromutils.pyto perform evaluation via API. - Uses multi-threaded API calls, then reorders results back to the original question index before computing accuracy.
- Saves results to the specified output file.
- Supports TVQA variants by adjusting the
--questions-pathinput.
utils.py¶
Includes run_simple_qa, which handles API calls for question answering. It preserves question-answer alignment under multi-threading by sorting completed futures with their original indices before metric calculation. Configure the API base URL (e.g., base_url) in utils.py to match the API provider's endpoint.
Prerequisites¶
- Ensure input files (questions JSON, subtitles JSON, frame directory) exist.
- Configure the API base URL in
utils.pyfor the chosen model. - The specified model must be supported by the API provider.
- Install Python dependencies (e.g.,
argparse,os). - Set
--threadsbased on the API provider's rate limits to avoid overloading.
Notes¶
- Update the API base URL in
utils.pyto match your provider's endpoint. - Use a thread count suitable for the API's maximum load to optimize performance.
- Ensure the output directory is writable for result storage.
For TVQA+ or other variants, provide the appropriate questions JSON file via --questions-path.