Transcript Import

Canonical Model

Transcript files live in _data/transcripts/*.yml.
Video assets reference transcripts via _data/video_assets.yml transcript_id.
_data/transcripts.yml is legacy and not used for active content.

Audit current repository transcript integrity:
- ./bin/transcripts audit
Build ID-suffixed staging files (recommended for ambiguous filenames):
- ./bin/transcripts prepare --source-dir /Volumes/Dock_1TB/vimeo/outbox --output-dir tmp/transcript-id-staging --min-confidence 0.8 --clean-output
Run import in dry-run mode:
- ./bin/transcripts dry-run --source-dir tmp/transcript-id-staging --min-confidence 0.9
Review output reports:
- tmp/transcript-import-report.json
- tmp/transcript-import-report.md
Apply high-confidence mappings:
- ./bin/transcripts ingest --source-dir tmp/transcript-id-staging --min-confidence 0.9

If filenames already include explicit IDs and do not need staging:

./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9

Run import in dry-run mode:
- ./bin/transcripts dry-run --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
Review output reports:
- tmp/transcript-import-report.json
- tmp/transcript-import-report.md
Apply high-confidence mappings:
- ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --min-confidence 0.9
Re-run pipeline validation:
- ./bin/transcripts validate

Ingest + audit + validate + commit:
- ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit
Ingest + audit + validate + commit + push:
- ./bin/transcripts ingest --source-dir /Volumes/Dock_1TB/vimeo/outbox --auto-commit --auto-push

Supported source file formats: .txt, .md, .srt, .vtt.
Existing transcript files are not overwritten unless --force is supplied.
Low-confidence mappings are never auto-applied; review those in the report first.