Generating Video Transcripts with Whisper

August 31, 2025 13 minute read

One of my goals when building this site (and its earlier iterations) was to maintain a back catalogue of presentations I’ve given over the years. This is one of my main ways of sharing work externally and having them all in one place gives me a few benefits:

A single timeline of what I’ve presented
A canonical reference point for each talk (useful for cross-referencing, follow-up, etc.)
A place to easily search when “I’m sure I’ve spoken about this somewhere”

The last of these requires transcription, and was the original intention for the process I’ll describe here.

All the source code related to this process can be found in the scorg-utils GitHub repository.

As an example, I’ll use my lightning talk from ORConf last year. As it’s just two minutes long, it makes a good test case for the process without the overhead of something longer.

Whisper

whisper is an open source audio transcription model from OpenAI and first released in 2022. It quickly earned a reputation for being free, fast, and accurate, and has been my go-to transcription tool ever since.¹

Whisper comes as a family of models, each trading off speed and memory for accuracy. There are many different bindings available for the models to fit into various workflows. Here I’m using whisper.cpp as it runs faster on my laptop compared to the PyPI version.

I won’t cover installing whisper.cpp here, but both its README and Homebrew formula provide good starting points. In my case I’m using the turbo model, as that runs faster than real time and produces good results.

With whisper installed, the plan is to generate a WebVTT² file of the talk, which will be shown on the talks webpage and kept in sync alongside the playing video.

Simple generation

With a model downloaded, whisper.cpp can be used to create a simple transcript right out of the box. This is done with the following commands:

ffmpeg -i ${INPUT_VIDEO} -ar 16000 -ac 1 -c:a pcm_s16le ${OUTPUT_WAV}
whisper-cli -m ${HOME}/.models/whisper/ggml-large-v3-turbo.bin -sow -l en \
  -ovtt ${OUTPUT_WAV}

whisper.cpp only accepts WAV files as inputs, so I first extract the audio using ffmpeg. For the transcription itself, I use the following options:

-m <path>: Path to the model to use
-sow: Split transcript by whole words. This makes the transcript easier to read
-l en: Input language, in this case English
-ovtt: Output a WebVTT file

This produces an output file named like the input, but with .vtt appended to the end, and looks like this:

Right, hi. For those that don't know me, I'm Simon. When I'm not the person behind the

camera doing filming at Orconf, I work at Embercosm. You may have heard of us because

we do compilers, mostly OpenRISC and RISC-V compilers, and I spend a lot of time building

a lot of compilers. Our website, once a week, takes the top of tree of GCC and LLVM and

builds RISC-V toolchains, OpenRISC toolchains, and when I say a lot of toolchains, I mean

there was a release of GCC and LLVM at the same time, and let's just say the two machines under

my desk were very unhappy with me. The reason I'm mostly talking is because I just stick these

toolchains out on the internet, and apparently people use them. For instance, when the RP 2350

was announced a couple of months ago, I was like, hey, I wonder how the boot run works?

And there's just a random reference to one of my compilers in there.

I should work on SEO, because if you actually Google that string, Google says there are zero results

except for this page, so clearly some work there. But it got me thinking, if people are using these

toolchains, are they any good? Does people have any feedback? So this is sort of an open call for

anybody who's using RISC-V or OpenRISC, GCC, or LLVM. Tell me, what works? What sucks? Do you wish you

had binaries that ran natively on A064 or RISC-V? We mostly use Nulib. Does PcLib C of interest

to people these days? Does ZYX extension? Because that would be a standard one, not XYZ.

Does the compiler break? No, really, the compiler is definitely wrong. Tell me why we've done

something wrong. Well, the community at large has done something wrong. Or anything other interesting

you're doing with the tools. Basically, any feedback at all, because as I say, I throw these on the internet,

and yeah, apparently people are using them. So any feedback at all, throw them at this email,

toolchains@embacosm.com. I can't guarantee that I will respond to all of your emails, but I can

guarantee I'll read them. So be nice. Thank you.

APPLAUSE

This is a good start, but there are some issues which need cleaning up:

Some words are mistranscribed. For example, “Newlib” is “nulib”, and “Picolibc” is “PcLib C”.
The sentence structure is a bit off. Some lines are too long and it breaks at seemingly random places. It would look better breaking up more along the sentence structure.

I’ll now cover an approach to solving these and producing a more appealing output.

Generating transcripts which follow structure

To solve the structure issue, whisper.cpp provides a couple of options which control where line breaks are added. The first is the -sow option used above, and the second is -ml which sets a maximum line length. This helps with the long lines, but doesn’t fix the random breaks. It can be used however to provide the starting point for a neater output.

The idea here is first to have whisper generate a transcript that splits on every word. This can then be processed to merge single word lines together to produce a more natural output. Using both the -sow and -ml 1 options creates the following transcript:

Right,

hi.

For

those

who

don't

know

me,

I'm

Simon.

When

I'm

not

the

person

behind

the

camera

doing

filming

Orconf,

work

Embercosm.

You

may

have

heard

because

compilers,

mostly

OpenRISC

and

RISC-V

compilers,

and

spend

lot

time

building

lot

compilers.

Our

website,

once

week,

takes

the

top

tree

GCC

and

LLVM

and

builds

RISC-V

toolchains,

OpenRISC

toolchains,

and

when

say

lot

toolchains,

mean

there

was

release

GCC

and

LLVM

the

same

time,

and

let's

just

say

the

two

machines

under

desk

were

very

unhappy

with

me.

The

reason

I'm

mostly

talking

because

just

stick

these

toolchains

out

the

internet,

and

apparently

people

use

them.

For

instance,

when

the

RP2350

was

announced

couple

months

ago,

was

like,

"Hey,

wonder

how

the

boot

run

works?"

And

there's

just

random

reference

one

compilers

there.

should

work

SEO,

because

you

actually

Google

that

string,

Google

says

there

are

zero

results

except

for

this

page,

clearly

some

work

there.

But

got

thinking,

people

are

using

these

toolchains,

are

they

any

good?

Does

people

have

any

feedback?

this

sort

open

call

for

anybody

who's

using

RISC-V

OpenRISC

GCC

LLVM.

Tell

what

works,

what

sucks?

you

wish

you

had

binaries

that

ran

natively

064

RISC-V?

mostly

use

NewLib.

Does

Peacolib

interest

people

these

days?

Does

ZYX

extension,

because

that

would

standard

one,

not

XYZ?

Does

the

compiler

break?

No,

really.

The

compiler's

definitely

wrong.

Tell

why

we've

done

something

wrong.

well,

the

community

large

has

done

something

wrong.

anything

other

interesting

you're

doing

with

the

tools.

Basically,

any

feedback

all.

Because

say,

throw

these

the

internet,

and

yeah,

apparently

people

are

using

them.

any

feedback

all,

throw

them

this

email,

toolchains@embacosm.com.

can't

guarantee

that

will

respond

all

your

emails,

but

can

guarantee

I'll

read

them.

nice.

Thank

you.

Post processing consists of taking each line and testing whether it should be merged with the preceding one. This is done with the following ruleset:

Does the previous line end with ., ?, or !? Start a new line.
Does the previous line have 12 or more words, and end with ,? Start a new line.
Otherwise, part of the previous line.

The merging is done through this Python script. When combining lines, it is important to update the start and end times for the new line to cover the combined timespan.

While the output isn’t perfect, it only needs minimal tweaks, and so is a good start for the next stage:

Right, hi.

For those who don't know me, I'm Simon.

When I'm not the person behind the camera doing filming at Orconf,

I work at Embercosm.

You may have heard of us because we do compilers, mostly OpenRISC and RISC-V compilers,

and I spend a lot of time building a lot of compilers.

Our website, once a week, takes the top of tree of GCC and LLVM and builds RISC-V toolchains,

OpenRISC toolchains, and when I say a lot of toolchains, I mean there was a release of GCC and LLVM at the same time,

and let's just say the two machines under my desk were very unhappy with me.

The reason I'm mostly talking is because I just stick these toolchains out on the internet,

and apparently people use them.

For instance, when the RP2350 was announced a couple of months ago,

I was like, "Hey, I wonder how the boot run works?" And there's just a random reference to one of my compilers in there.

I should work on SEO, because if you actually Google that string,

Google says there are zero results except for this page, so clearly some work there.

But it got me thinking, if people are using these toolchains, are they any good?

Does people have any feedback?

So this is sort of an open call for anybody who's using RISC-V or OpenRISC GCC or LLVM.

Tell me what works, what sucks?

Do you wish you had binaries that ran natively on 064 or RISC-V?

We mostly use NewLib.

Does Peacolib C of interest to people these days?

Does ZYX extension, because that would be a standard one, not XYZ?

Does the compiler break?

No, really.

The compiler's definitely wrong.

Tell me why we've done something wrong.

well, the community at large has done something wrong.

Or anything other interesting you're doing with the tools.

Basically, any feedback at all.

Because as I say, I throw these on the internet, and yeah,

apparently people are using them.

So any feedback at all, throw them at this email, toolchains@embacosm.com.

I can't guarantee that I will respond to all of your emails,

but I can guarantee I'll read them.

So be nice.

Thank you.

Editing the transcript

With the general layout now looking good, the final step is the most important, reviewing and editing. Edits here fall into one of the following categories:

Fixing small layout changes the previous step didn’t get right.
Correcting words so the transcript matches what was actually said.

The second of these takes the most time, watching the video and correcting words as it plays. For those that don’t like the sound of their own voice (guilty!), this can be painful, but it’s worth it for the best result.

A nice side benefit is this doubles as an opportunity to review the presentation itself. I firmly believe it’s worth doing this to see what you would tweak next time, especially when you’re new to giving presentations. Doing these at the same time might make this feel less burdensome.

In terms of tooling, I don’t have any recommendations here. This step I just do in a text editor, making sure to keep the structure of the file valid. One thing I’ve found useful is if you’re moving words from one line to another, having the “single word per line” file to hand makes it easier to keep timestamps in sync.

The end result is a polished transcript, ready to be used alongside the video:

Right, hi. For those that don't know me, I'm Simon.

When I'm not the person behind the camera doing filming at ORConf, I work at Embecosm.

You may have heard of us because we do compilers, mostly OpenRISC and RISC-V compilers,

and I spend a lot of time building a lot of compilers.

Our website, once a week, takes the top of tree of GCC and LLVM and builds RISC-V toolchains,

OpenRISC toolchains, and when I say a lot of toolchains, I mean there was a release of GCC and LLVM at the same time,

and let's just say the two machines under my desk were very unhappy with me.

The reason I'm mostly talking is because I just stick these toolchains out on the Internet,

and apparently people use them.

For instance, when the RP2350 was announced a couple of months ago,

I was like, hey, I wonder how the Boot ROM works?

And there's just a random reference to one of my compilers in there.

I should work on SEO, because if you actually Google that string,

Google says there are zero results except for this page, so clearly some work there.

But it got me thinking, if people are using these toolchains, are they any good?

Do people have any feedback?

So this is sort of an open call for anybody who's using RISC-V or OpenRISC, GCC or LLVM.

Tell me, what works? What sucks?

Do you wish you had binaries that ran natively on AArch64 or RISC-V?

We mostly use Newlib. Does PicoLibc of interest to people these days?

Does ZYX extension? Because that would be a standard one, not XYZ.

Does the compiler break? No, really, the compiler is definitely wrong.

Tell me why we've done something wrong.

Well, the community at large has done something wrong.

Or anything other interesting you're doing with the tools.

Basically, any feedback at all, because as I say, I throw these on the Internet,

and yeah, apparently people are using them.

So any feedback at all, throw them at this email, toolchains@embecosm.com.

I can't guarantee that I will respond to all of your emails,

but I can guarantee I'll read them.

So be nice.

Thank you.

[APPLAUSE]

Integrating with the website

This website is built using Jekyll, so some details are tied to its build system. The underlying concepts though are generic, and the same result can be achieved with any framework.

The end goal, as I mentioned in the introduction, is to show a transcript which remains in sync with a playing video, and selecting a line in the transcript jumps to that part of the video. In this case, the video is hosted on YouTube, and thankfully YouTube offers a solution to this exact problem.

The YouTube Player API lets us control an embedded YouTube player using JavaScript. I use this to run a timer every 250ms to check the current play time. This then highlights the matching line in the transcript. The inverse works similarly; each line gets its onclick handler set to seek the video to the appropriate time.

This script can be found in the scorg-utils repository. I have tried to comment it sufficiently to make it clear what is going on. The core “find the correct line” logic is shown below.³

  // Search all transcript lines for the correct one to assume as active, give
  // this the 'active' class if it is, and if it has changed trigger a scroll
  // of the container.
  let lastActive = Infinity;
  function syncTranscript() {
    const currentTime = player.getCurrentTime();
    transcriptLines.forEach((line, i) => {
      const start = parseFloat(line.dataset.start);
      const next = transcriptLines[i + 1] ?
          parseFloat(transcriptLines[i + 1].dataset.start) : Infinity;
      const isActive = currentTime >= start && currentTime < next;
      line.classList.toggle("active", isActive);
      if (isActive && i != lastActive) {
        scrollToLine(line);
        lastActive = i;
      }
    });
  }

The transcript is added through a custom Jekyll tag which reads a named VTT file, and outputs it in the following format:

<div id="transcript-box">
    <div class="transcript-line" data-start="0.0">
        <p>Example line</p>
    </div>
    <!-- ... -->
</div>

This format is used for three reasons:

By generating directly from the VTT file, there is one less manual step, and the VTT file becomes a single source of truth.
By emitting real HTML rather than, for example, storing the data in a JavaScript variable, it is easier to statically search.
HTML data attributes can be used to store the start timestamp, making it easier to track which line should be considered “active”.

This process is then wrapped up in a single include, so on a page where I want to add a video and transcript I simply write:

{% include scorg-transcript.md ytvideo="eXWzsR6UX3M" transcript="orconf24" %}

And the full video player and transcript system is loaded.

Optional bonus step: Apply this as captions for the original video

As a final step, these transcripts can also be uploaded as captions for the original videos, making them more accessible to a wider audience.

I’m not sure what other video providers support in this regard, but YouTube provides native support for importing WebVTT files for captioning. Details on this can be found in this YouTube Help article.

Future steps

Now I have this process in hand, there are two things in this area I want to tackle next.

Firstly is applying this to other presentations I’ve given. Currently I have 19 talks in the backlog for review; over time more of those will be added here with full transcripts.

Secondly I want to add slide decks next to each video. Ideally these would behave the same as the transcript, changing one would update the others in tandem. I’m not sure yet exactly how I’ll do that (though have some rough ideas), but look forward to tackling it soon.

I hope the above has been useful if you try to achieve something similar. If you have any thoughts or ideas on this process, I’d love to hear from you.

While the AI landscape continues to evolve rapidly, and there are likely newer and similarly impressive models, whisper remains a useful and reliable tool. ↩
Using WebVTT gives some flexibility in the future if I wish to use the transcript actually as captions on a playing video, for example if I’m hosting it myself. The documentation on MDN provides a good guide on how this could be done. ↩
In the interest of full transparency, I’d like to note that this code was originally generated with the help of ChatGPT. It has since been edited by me to better meet my requirements and add Jekyll integration. Since it’s not yet clear if/how to “unscramble the eggs” of AI and human co-written code, I can’t comment on how much of it falls under my own copyright. ↩

Simon Cook

Whisper

Simple generation

Generating transcripts which follow structure

Editing the transcript

Integrating with the website

Optional bonus step: Apply this as captions for the original video

Future steps