A project to collect transcripts from Coding Train videos.
See training-data.
- Collect YouTube video ids:
script/collect-video-ids.js
uses the Youtube API to collect all the video IDs from the Coding Train channel and save them todata/video-ids.json
- Download audio for each video:
script/download-audio-from-youtube.py
uses yt-dlp to download M4A audio files for each YouTube video - Upload audio files to cloud storage: This is a manual process for now. Put audio files on ByteScale with public URLs in the format
https://upcdn.io/FW25b4F/raw/coding-train/{youtube-video-id}.m4a
- Run a local webhook server: Run
node webhook-handler.js
to stand up a local Node.js server to receive webhooks - Run a local tunnel: Run
ngrok http 3000
to create a public URL for the webhook server - Transcribe audio using Whisper on Replicate: Run
NGROK_HOST="your-ngrok-host" node script/transcribe-audio.mjs
- User Whisper on Replicate to transcribe audio files to text. The webhook handler takes care of receiving the finished transcripts and saving them to disk in the transcripts directory. - Create LLM training data:
script/compile-llm-training-data.js
gathers up all the transcriptions and stuffs them into a single JSONL (JSON lines) file for training a language model like Llama 2.