Generating Video Transcripts with Whisper
One of my goals when building this site (and its earlier iterations) was to maintain a back catalogue of presentations I’ve given over the years. This is one of my main ways of sharing work externally and having them all in one place gives me a few benefits:
- A single timeline of what I’ve presented
- A canonical reference point for each talk (useful for cross-referencing, follow-up, etc.)
- A place to easily search when “I’m sure I’ve spoken about this somewhere”
The last of these requires transcription, and was the original intention for the process I’ll describe here.
All the source code related to this process can be found in the scorg-utils GitHub repository.
As an example, I’ll use my lightning talk from ORConf last year. As it’s just two minutes long, it makes a good test case for the process without the overhead of something longer.
Whisper
whisper is an open source audio transcription model from OpenAI and first released in 2022. It quickly earned a reputation for being free, fast, and accurate, and has been my go-to transcription tool ever since.1
Whisper comes as a family of models, each trading off speed and memory for accuracy. There are many different bindings available for the models to fit into various workflows. Here I’m using whisper.cpp as it runs faster on my laptop compared to the PyPI version.
I won’t cover installing whisper.cpp here, but both its README and Homebrew formula provide good starting points. In my case I’m using the turbo model, as that runs faster than real time and produces good results.
With whisper installed, the plan is to generate a WebVTT2 file of the talk, which will be shown on the talks webpage and kept in sync alongside the playing video.
Simple generation
With a model downloaded, whisper.cpp can be used to create a simple transcript right out of the box. This is done with the following commands:
ffmpeg -i ${INPUT_VIDEO} -ar 16000 -ac 1 -c:a pcm_s16le ${OUTPUT_WAV}
whisper-cli -m ${HOME}/.models/whisper/ggml-large-v3-turbo.bin -sow -l en \
-ovtt ${OUTPUT_WAV}
whisper.cpp only accepts WAV files as inputs, so I first extract the audio using ffmpeg. For the transcription itself, I use the following options:
-m <path>
: Path to the model to use-sow
: Split transcript by whole words. This makes the transcript easier to read-l en
: Input language, in this case English-ovtt
: Output a WebVTT file
This produces an output file named like the input, but with .vtt
appended to
the end, and looks like this:
Right, hi. For those that don't know me, I'm Simon. When I'm not the person behind the
camera doing filming at Orconf, I work at Embercosm. You may have heard of us because
we do compilers, mostly OpenRISC and RISC-V compilers, and I spend a lot of time building
a lot of compilers. Our website, once a week, takes the top of tree of GCC and LLVM and
builds RISC-V toolchains, OpenRISC toolchains, and when I say a lot of toolchains, I mean
there was a release of GCC and LLVM at the same time, and let's just say the two machines under
my desk were very unhappy with me. The reason I'm mostly talking is because I just stick these
toolchains out on the internet, and apparently people use them. For instance, when the RP 2350
was announced a couple of months ago, I was like, hey, I wonder how the boot run works?
And there's just a random reference to one of my compilers in there.
I should work on SEO, because if you actually Google that string, Google says there are zero results
except for this page, so clearly some work there. But it got me thinking, if people are using these
toolchains, are they any good? Does people have any feedback? So this is sort of an open call for
anybody who's using RISC-V or OpenRISC, GCC, or LLVM. Tell me, what works? What sucks? Do you wish you
had binaries that ran natively on A064 or RISC-V? We mostly use Nulib. Does PcLib C of interest
to people these days? Does ZYX extension? Because that would be a standard one, not XYZ.
Does the compiler break? No, really, the compiler is definitely wrong. Tell me why we've done
something wrong. Well, the community at large has done something wrong. Or anything other interesting
you're doing with the tools. Basically, any feedback at all, because as I say, I throw these on the internet,
and yeah, apparently people are using them. So any feedback at all, throw them at this email,
toolchains@embacosm.com. I can't guarantee that I will respond to all of your emails, but I can
guarantee I'll read them. So be nice. Thank you.
APPLAUSE
This is a good start, but there are some issues which need cleaning up:
- Some words are mistranscribed. For example, “Newlib” is “nulib”, and “Picolibc” is “PcLib C”.
- The sentence structure is a bit off. Some lines are too long and it breaks at seemingly random places. It would look better breaking up more along the sentence structure.
I’ll now cover an approach to solving these and producing a more appealing output.
Generating transcripts which follow structure
To solve the structure issue, whisper.cpp provides a couple of options which
control where line breaks are added. The first is the -sow
option used above,
and the second is -ml
which sets a maximum line length. This helps with the
long lines, but doesn’t fix the random breaks. It can be used however to provide
the starting point for a neater output.
The idea here is first to have whisper generate a transcript that splits on
every word. This can then be processed to merge single word lines together to
produce a more natural output. Using both the -sow
and -ml 1
options creates
the following transcript:
Right,
hi.
For
those
who
don't
know
me,
I'm
Simon.
When
I'm
not
the
person
behind
the
camera
doing
filming
at
Orconf,
I
work
at
Embercosm.
You
may
have
heard
of
us
because
we
do
compilers,
mostly
OpenRISC
and
RISC-V
compilers,
and
I
spend
a
lot
of
time
building
a
lot
of
compilers.
Our
website,
once
a
week,
takes
the
top
of
tree
of
GCC
and
LLVM
and
builds
RISC-V
toolchains,
OpenRISC
toolchains,
and
when
I
say
a
lot
of
toolchains,
I
mean
there
was
a
release
of
GCC
and
LLVM
at
the
same
time,
and
let's
just
say
the
two
machines
under
my
desk
were
very
unhappy
with
me.
The
reason
I'm
mostly
talking
is
because
I
just
stick
these
toolchains
out
on
the
internet,
and
apparently
people
use
them.
For
instance,
when
the
RP2350
was
announced
a
couple
of
months
ago,
I
was
like,
"Hey,
I
wonder
how
the
boot
run
works?"
And
there's
just
a
random
reference
to
one
of
my
compilers
in
there.
I
should
work
on
SEO,
because
if
you
actually
that
string,
says
there
are
zero
results
except
for
this
page,
so
clearly
some
work
there.
But
it
got
me
thinking,
if
people
are
using
these
toolchains,
are
they
any
good?
Does
people
have
any
feedback?
So
this
is
sort
of
an
open
call
for
anybody
who's
using
RISC-V
or
OpenRISC
GCC
or
LLVM.
Tell
me
what
works,
what
sucks?
Do
you
wish
you
had
binaries
that
ran
natively
on
064
or
RISC-V?
We
mostly
use
NewLib.
Does
Peacolib
C
of
interest
to
people
these
days?
Does
ZYX
extension,
because
that
would
be
a
standard
one,
not
XYZ?
Does
the
compiler
break?
No,
really.
The
compiler's
definitely
wrong.
Tell
me
why
we've
done
something
wrong.
well,
the
community
at
large
has
done
something
wrong.
Or
anything
other
interesting
you're
doing
with
the
tools.
Basically,
any
feedback
at
all.
Because
as
I
say,
I
throw
these
on
the
internet,
and
yeah,
apparently
people
are
using
them.
So
any
feedback
at
all,
throw
them
at
this
email,
toolchains@embacosm.com.
I
can't
guarantee
that
I
will
respond
to
all
of
your
emails,
but
I
can
guarantee
I'll
read
them.
So
be
nice.
Thank
you.
Post processing consists of taking each line and testing whether it should be merged with the preceding one. This is done with the following ruleset:
- Does the previous line end with
.
,?
, or!
? Start a new line. - Does the previous line have 12 or more words, and end with
,
? Start a new line. - Otherwise, part of the previous line.
The merging is done through this Python script. When combining lines, it is important to update the start and end times for the new line to cover the combined timespan.
While the output isn’t perfect, it only needs minimal tweaks, and so is a good start for the next stage:
Right, hi.
For those who don't know me, I'm Simon.
When I'm not the person behind the camera doing filming at Orconf,
I work at Embercosm.
You may have heard of us because we do compilers, mostly OpenRISC and RISC-V compilers,
and I spend a lot of time building a lot of compilers.
Our website, once a week, takes the top of tree of GCC and LLVM and builds RISC-V toolchains,
OpenRISC toolchains, and when I say a lot of toolchains, I mean there was a release of GCC and LLVM at the same time,
and let's just say the two machines under my desk were very unhappy with me.
The reason I'm mostly talking is because I just stick these toolchains out on the internet,
and apparently people use them.
For instance, when the RP2350 was announced a couple of months ago,
I was like, "Hey, I wonder how the boot run works?" And there's just a random reference to one of my compilers in there.
I should work on SEO, because if you actually Google that string,
Google says there are zero results except for this page, so clearly some work there.
But it got me thinking, if people are using these toolchains, are they any good?
Does people have any feedback?
So this is sort of an open call for anybody who's using RISC-V or OpenRISC GCC or LLVM.
Tell me what works, what sucks?
Do you wish you had binaries that ran natively on 064 or RISC-V?
We mostly use NewLib.
Does Peacolib C of interest to people these days?
Does ZYX extension, because that would be a standard one, not XYZ?
Does the compiler break?
No, really.
The compiler's definitely wrong.
Tell me why we've done something wrong.
well, the community at large has done something wrong.
Or anything other interesting you're doing with the tools.
Basically, any feedback at all.
Because as I say, I throw these on the internet, and yeah,
apparently people are using them.
So any feedback at all, throw them at this email, toolchains@embacosm.com.
I can't guarantee that I will respond to all of your emails,
but I can guarantee I'll read them.
So be nice.
Thank you.
Editing the transcript
With the general layout now looking good, the final step is the most important, reviewing and editing. Edits here fall into one of the following categories:
- Fixing small layout changes the previous step didn’t get right.
- Correcting words so the transcript matches what was actually said.
The second of these takes the most time, watching the video and correcting words as it plays. For those that don’t like the sound of their own voice (guilty!), this can be painful, but it’s worth it for the best result.
A nice side benefit is this doubles as an opportunity to review the presentation itself. I firmly believe it’s worth doing this to see what you would tweak next time, especially when you’re new to giving presentations. Doing these at the same time might make this feel less burdensome.
In terms of tooling, I don’t have any recommendations here. This step I just do in a text editor, making sure to keep the structure of the file valid. One thing I’ve found useful is if you’re moving words from one line to another, having the “single word per line” file to hand makes it easier to keep timestamps in sync.
The end result is a polished transcript, ready to be used alongside the video:
Right, hi. For those that don't know me, I'm Simon.
When I'm not the person behind the camera doing filming at ORConf, I work at Embecosm.
You may have heard of us because we do compilers, mostly OpenRISC and RISC-V compilers,
and I spend a lot of time building a lot of compilers.
Our website, once a week, takes the top of tree of GCC and LLVM and builds RISC-V toolchains,
OpenRISC toolchains, and when I say a lot of toolchains, I mean there was a release of GCC and LLVM at the same time,
and let's just say the two machines under my desk were very unhappy with me.
The reason I'm mostly talking is because I just stick these toolchains out on the Internet,
and apparently people use them.
For instance, when the RP2350 was announced a couple of months ago,
I was like, hey, I wonder how the Boot ROM works?
And there's just a random reference to one of my compilers in there.
I should work on SEO, because if you actually Google that string,
Google says there are zero results except for this page, so clearly some work there.
But it got me thinking, if people are using these toolchains, are they any good?
Do people have any feedback?
So this is sort of an open call for anybody who's using RISC-V or OpenRISC, GCC or LLVM.
Tell me, what works? What sucks?
Do you wish you had binaries that ran natively on AArch64 or RISC-V?
We mostly use Newlib. Does PicoLibc of interest to people these days?
Does ZYX extension? Because that would be a standard one, not XYZ.
Does the compiler break? No, really, the compiler is definitely wrong.
Tell me why we've done something wrong.
Well, the community at large has done something wrong.
Or anything other interesting you're doing with the tools.
Basically, any feedback at all, because as I say, I throw these on the Internet,
and yeah, apparently people are using them.
So any feedback at all, throw them at this email, toolchains@embecosm.com.
I can't guarantee that I will respond to all of your emails,
but I can guarantee I'll read them.
So be nice.
Thank you.
[APPLAUSE]
Integrating with the website
This website is built using Jekyll, so some details are tied to its build system. The underlying concepts though are generic, and the same result can be achieved with any framework.
The end goal, as I mentioned in the introduction, is to show a transcript which remains in sync with a playing video, and selecting a line in the transcript jumps to that part of the video. In this case, the video is hosted on YouTube, and thankfully YouTube offers a solution to this exact problem.
The YouTube Player API
lets us control an embedded YouTube player using JavaScript. I use this to run a
timer every 250ms to check the current play time. This then highlights the
matching line in the transcript. The inverse works similarly; each line gets its
onclick
handler set to seek the video to the appropriate time.
This script
can be found in the scorg-utils
repository. I have tried to comment it
sufficiently to make it clear what is going on. The core “find the correct line”
logic is shown below.3
// Search all transcript lines for the correct one to assume as active, give
// this the 'active' class if it is, and if it has changed trigger a scroll
// of the container.
let lastActive = Infinity;
function syncTranscript() {
const currentTime = player.getCurrentTime();
transcriptLines.forEach((line, i) => {
const start = parseFloat(line.dataset.start);
const next = transcriptLines[i + 1] ?
parseFloat(transcriptLines[i + 1].dataset.start) : Infinity;
const isActive = currentTime >= start && currentTime < next;
line.classList.toggle("active", isActive);
if (isActive && i != lastActive) {
scrollToLine(line);
lastActive = i;
}
});
}
The transcript is added through a custom Jekyll tag which reads a named VTT file, and outputs it in the following format:
<div id="transcript-box">
<div class="transcript-line" data-start="0.0">
<p>Example line</p>
</div>
<!-- ... -->
</div>
This format is used for three reasons:
- By generating directly from the VTT file, there is one less manual step, and the VTT file becomes a single source of truth.
- By emitting real HTML rather than, for example, storing the data in a JavaScript variable, it is easier to statically search.
- HTML data attributes can be used to store the start timestamp, making it easier to track which line should be considered “active”.
This process is then wrapped up in a single include, so on a page where I want to add a video and transcript I simply write:
{% include scorg-transcript.md ytvideo="eXWzsR6UX3M" transcript="orconf24" %}
And the full video player and transcript system is loaded.
Optional bonus step: Apply this as captions for the original video
As a final step, these transcripts can also be uploaded as captions for the original videos, making them more accessible to a wider audience.
I’m not sure what other video providers support in this regard, but YouTube provides native support for importing WebVTT files for captioning. Details on this can be found in this YouTube Help article.
Future steps
Now I have this process in hand, there are two things in this area I want to tackle next.
Firstly is applying this to other presentations I’ve given. Currently I have 19 talks in the backlog for review; over time more of those will be added here with full transcripts.
Secondly I want to add slide decks next to each video. Ideally these would behave the same as the transcript, changing one would update the others in tandem. I’m not sure yet exactly how I’ll do that (though have some rough ideas), but look forward to tackling it soon.
I hope the above has been useful if you try to achieve something similar. If you have any thoughts or ideas on this process, I’d love to hear from you.
-
While the AI landscape continues to evolve rapidly, and there are likely newer and similarly impressive models, whisper remains a useful and reliable tool. ↩
-
Using WebVTT gives some flexibility in the future if I wish to use the transcript actually as captions on a playing video, for example if I’m hosting it myself. The documentation on MDN provides a good guide on how this could be done. ↩
-
In the interest of full transparency, I’d like to note that this code was originally generated with the help of ChatGPT. It has since been edited by me to better meet my requirements and add Jekyll integration. Since it’s not yet clear if/how to “unscramble the eggs” of AI and human co-written code, I can’t comment on how much of it falls under my own copyright. ↩