OpenCode School

Exercise 4

Transcribe speech

Transcribe audio to text locally using Whisper.

Every major tech company runs speech-to-text models in the cloud, but you can do the same thing locally on your own machine — no API keys, no upload limits, no privacy concerns. In this exercise, you’ll use Whisper, OpenAI’s open-source speech recognition model, to transcribe audio to text.

You’ll use whisper-cpp, a fast C++ port of the original Python model. It runs entirely on your CPU (or GPU if available), and the Homebrew install is a single command.

Install the tools

You’ll need two tools:

brew install whisper-cpp ffmpeg

whisper-cpp runs the transcription model. FFMPEG converts audio files to the 16kHz WAV format that Whisper requires. If you did the Edit Videos exercise, you already have FFMPEG installed.

On Linux, build whisper-cpp from source or use your system package manager. See the whisper.cpp repo for instructions.

Download a model

Whisper comes in multiple sizes — from tiny (75MB) to large (3GB). You’ll download a quantized version of the large model that offers a good balance of accuracy and file size (~1GB):

curl -o ggml-large-v3-q5_0.bin -L \
  'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-large-v3-q5_0.bin?download=true'

Get some audio

You need audio with speech to transcribe. Pick whichever option sounds easiest:

  • Record yourself — use QuickTime Player on macOS (File > New Audio Recording), or any recording app
  • Reuse audio from the Edit Videos exercise — if you already have an extracted audio track, use that
  • Download something — a podcast clip, a speech, a lecture, anything with clear spoken words

What you’ll do

  • Convert your audio to the 16kHz WAV format Whisper expects
  • Run transcription and get a timestamped text output
  • Export the transcript as SRT or VTT subtitles — the standard formats used for video captions

Whisper can also translate speech from other languages into English. If you have audio in another language, you can try that too.