Offline Grounding Cache¶

For training data preprocessing, run a standalone offline grounding pass to generate an initial occur-clip cache (question -> clip), then inject this cache into your training prompt pipeline.

We strongly recommend this step because high-quality initial subtitles are important for downstream reasoning. If no cache is provided, the pipeline may fall back to randomly selecting one clip's subtitles, which makes later Master Agent reasoning much harder.

Script¶

src/dataset/build_grounding_cache.py

Usage¶

python src/dataset/build_grounding_cache.py \
  --dataset tvqa_plus \
  --questions-path /path/to/train.json \
  --subs-path /path/to/all_episodes_subtitles_by_clips.json \
  --grounding-model "grok-4-fast-reasoning" \
  --grounding-base-url "https://api2.aigcbest.top/v1" \
  --output-dir /path/to/cache_dir \
  --threads 8

Optional Arguments¶

--grounding-api-key: if omitted, reads env qdd_api.
--output-filename: custom cache filename; default is grounding_cache_{dataset}_{model}.json.
--max-samples: run on first N samples for smoke test.
--overwrite: force regenerate all entries.

Output Format¶

The cache is a JSON object keyed by sample index. Each entry includes at least:

question: question text
clip: predicted occur clip label

This format is compatible with the cache lookup used in:

src/evaluation/lvagent/evaluate_api_unified.py via --grounding-cache-json-path