Offline Grounding Cache¶
For training data preprocessing, run a standalone offline grounding pass to generate an initial occur-clip cache (question -> clip), then inject this cache into your training prompt pipeline.
We strongly recommend this step because high-quality initial subtitles are important for downstream reasoning. If no cache is provided, the pipeline may fall back to randomly selecting one clip's subtitles, which makes later Master Agent reasoning much harder.
Script¶
src/dataset/build_grounding_cache.py
Usage¶
python src/dataset/build_grounding_cache.py \
--dataset tvqa_plus \
--questions-path /path/to/train.json \
--subs-path /path/to/all_episodes_subtitles_by_clips.json \
--grounding-model "grok-4-fast-reasoning" \
--grounding-base-url "https://api2.aigcbest.top/v1" \
--output-dir /path/to/cache_dir \
--threads 8
Optional Arguments¶
--grounding-api-key: if omitted, reads envqdd_api.--output-filename: custom cache filename; default isgrounding_cache_{dataset}_{model}.json.--max-samples: run on first N samples for smoke test.--overwrite: force regenerate all entries.
Output Format¶
The cache is a JSON object keyed by sample index. Each entry includes at least:
question: question textclip: predicted occur clip label
This format is compatible with the cache lookup used in:
src/evaluation/lvagent/evaluate_api_unified.pyvia--grounding-cache-json-path