Back to blog

Choosing the Right Whisper Transcription ...

5 min read

I recently built a batch transcription pipeline for course videos and conference calls. The goal was straightforward. Point a script at a folder, let it crunch through dozens of hours of audio, and spit out transcripts, SRT files, and VTT files. The harder decision was picking the actual transcription engine.

On paper, the options for running Whisper on an M2 Max look simple. In practice, the hardware acceleration story is messy and the model availability dictates your choice more than raw speed.

I evaluated three engines. Faster-whisper uses CTranslate2 under the hood. Mlx-whisper uses Apple’s MLX framework. Openai-whisper is the original OpenAI implementation. Each has different hardware paths and different model support, which makes direct comparisons difficult.

The CTranslate2 engine in faster-whisper has a confusing hardware story on Apple Silicon. CTranslate2 is designed to run on GPUs. On an NVIDIA card, it uses CUDA. On a Mac, it falls back to running on the CPU through the Apple Accelerate framework. It does not use the Metal GPU. This matters because the CPU is a shared resource on a Mac, and heavy compute tasks can bog down the rest of the system.

Mlx-whisper actually taps into the Metal GPU. Apple built MLX specifically for machine learning workloads on their silicon, and it shows. The GPU handles the transcription workload, leaving the CPU free for other tasks.

The third option was the original openai-whisper engine. I ran this on a Windows PC with an NVIDIA GPU and CUDA. That machine was a 2014 i7-4970K with a GTX 1080 Ti. The power supply was degrading from 2016. It was time to retire it anyway, but the real justification was that the M2 Max was already noticeably faster across the board.

Model selection complicates the engine choice. The distil-large-v3 model is the fastest option that maintains near-best accuracy. It is only available in faster-whisper. Mlx-whisper does not support it. The original openai-whisper does not support it either. If you want distil-large-v3, you are committing to faster-whisper and its CPU-bound execution on Apple Silicon.

The turbo model, large-v3-turbo, is available in all three engines. This makes it the only fair ground for a direct comparison. When I compared distil-large-v3 running on the CPU via faster-whisper against the turbo model running on the Metal GPU via mlx-whisper, the Metal version pulled ahead. The GPU acceleration made a clear difference.

I looked at two other options before settling on this final comparison. MetalRT, also called RunAnywhere, seemed promising but it requires an M3 chip or newer for its GPU engine. On an M2 Max, it falls back to llama.cpp, which defeats the purpose. WhisperKit uses Core ML for inference, which sounds ideal for Apple hardware, but it lacks both the turbo and distil models. It only supports the standard Whisper sizes. That was a dealbreaker for my use case.

The batch processing script needed specific features to handle a large volume of files reliably. Moving completed files to a done folder keeps the working directory clean. Deleting source files after successful transcription saves disk space when dealing with hundreds of video files. The most critical feature is the ability to resume and skip existing files. Transcription takes time, and a crashed process should not mean starting over.

Output organization matters when processing course content. Flat output with prefix paths prevents filename collisions. If two different course modules have a file named “intro.mp4”, the output transcripts need distinct names. Adding the relative path as a prefix solves this cleanly.

I needed multiple output formats from a single pass. Transcripts are for reading. SRT files are for importing into video editors. VTT files are for web players. Generating all three at once avoids running the same audio through the engine multiple times.

One upstream format caused unexpected problems. Some of the course transcripts were originally created as RTF documents and carried invisible markup into the text pipeline. When those transcripts were processed and stored in a vector database for semantic search, the RTF artifacts corrupted the embeddings. I added a cleanup step using the striprtf library to strip the markup before ingestion. Not a transcription problem, but it surfaced in the same pipeline.

Comparing the engines fairly required identical interfaces. I wrote a separate script for mlx-whisper that matched the command line interface of the faster-whisper script exactly. Same flags, same output structure, same behavior. The only variable was the engine underneath.

The final decision came down to a tradeoff between model efficiency and hardware acceleration. Distil-large-v3 on the CPU is fast because the model itself is small and optimized. Turbo on the Metal GPU is fast because the hardware acceleration is substantial. The GPU acceleration won in my testing.

For my batch workload on the M2 Max, mlx-whisper with the turbo model is the right choice. The Metal GPU keeps the system responsive during long batch jobs. The turbo model provides excellent accuracy. The pipeline runs end to end without intervention, churning through course videos and conference recordings and dropping organized transcripts, SRT, and VTT files into the output directory.

The retired Windows PC can stay retired.

Share

More writing