Convert A Slideshow/Presentation into HTML 5 Video

Sunday, July 23, 2023

A seemingly common task would be to convert a talk or presentation given as a slide show into a playable video using a standards compliant format like WebM, which plays in almost all HTML 5 compliant browsers, that is just about everything but Safari on iOS (because Apple are walled garden asses) IE (because derp duh) and Opera Mini (because … well maybe too much work).

These days, WebM supports the pretty fantastic encoder AV1, which is the new thing that is genuinely open source and everyone should use for everything for this reason. It’s also absurdly slow to encode to, so maybe not everything, but it yields better results than VP9, which is also pretty decent. Compared to the H.series encoders (263/4/5/6), my results put VP9 between 4 and 5, depending on content and AV1 between 5 and 6, which is pretty solid and there aren’t too many tradeoffs with h/x.265 for most applications and you get FOSS freedom. Hopefully there will be much better hardware encoder support soon, though I’m sure we’re not gonna get that in iOS anytime soon. H.264 remains a pretty solid format for most use as it has broad and well optimized hardware compressors and decompressors. H.265 seems pretty well entrenched as well, but H.266 has been slow to pick up and very unscientifically I see AV1 as likely to undermine the H.266 adoption, though we’re still a long way from the necessary hardware for mobile/IoT devices to be confident of such an outcome yet.

I spent some time messing around with the process and testing various parameters. The fruit of that labor is as follows.

Export your slides into some useful format

The first step is exporting slides into a useful format like PNG. Assuming you have or can convert a presentation to PDF, then you can extract all the pages to individual files with pdfseparate, a command line tool that comes with poppler (which you’ll have because you run Inkscape, right? You should).

pdfseparate document.pdf %d.pdf

Then convert them to .png using inkscape like:

find . -type f -name "*.pdf" -exec inkscape "{}" --export-type=png --pdf-poppler -w 1920 -h 1440 -o "{}.png"  \;

Set the width and height to meet your specific needs. Maybe 1920×1080? 1536×2160?

Extract or convert your audio file to wmv

This tutorial uses whisper to convert the audio file to text, which also makes it easy(ier) to get slide timings to about a second from the .vtt file whisper generates. The first step is converting the audio file to .wmv so whisper can eat it. I used ffmpeg (you need a recent version for AV1 support later so depending on distro, you might need to build it) to do this like

ffmpeg -hide_banner -i videofile.mkv -ac 1 audio.wav

Note that I am forcing mono audio with -ac 1 (and beware out of phase audio can cancel). I would expect most talks to not rely on stereo effects, but YMMV. Once you have this file and you’ve installed whisper (surprisingly easy), it will convert your long audio ramblings into equally (in)coherent text (the accuracy is surprisingly good) with a simple:

whisper audio.wav

At this point you should have a folder full of .png slides, an audio file of your talk in .wav, and a variety of text files in various text formats including a .vtt file that has time stamps in MM:SS.sss format and text and it is time for the first tedious bit. You need the from start HH:MM:SS.sss timing for each slide, plus the slide-to-slide timing in seconds (SSS), and it is helpful to have the from start seconds (SSSSSS) timing to verify key frame placement using ffprobe. Since Whisper’s .vtt file is only integer seconds, this isn’t going to be fractional second accurate timing, but good enough for slide transitions in a talk.

You could put this in a spreadsheet to automate the conversions from the .vtt’s format to the needed formats. I did it manually like this:

file 'QuantumComputingSlides1.png'
duration 00:28 28 28
file 'QuantumComputingSlides2.png'
duration 00:49 49 21
file 'QuantumComputingSlides3.png'
duration 00:55 55 6
file 'QuantumComputingSlides4.png'
duration 02:04 124 69
file 'QuantumComputingSlides5.png'
duration 03:49 229 105

You have to convert all those time stamps into absolute seconds. I haven’t automated this yet, but it should be doable with a little regexp and bash or python if you have a lot of slides. If I do this with a talk that has enough slides to justify the effort, I’ll post a converter script.

Convert the Stills to Video Clips

This part assumes no fancy transitions, just jump cuts and that each slide is meant to be on-screen for some amount of time, the audio droning on over it and, perhaps, subtitles below it. This isn’t high art, mind you, just a utilitarian conversion for web viewing. The first step is to determine the slide transitions points, which you’ll need in absolute time format and in seconds/slide format, data which should be pretty easy to access from the speech-to-text .vtt file:

WEBVTT

00:00.000 --> 00:05.000
Okay, today's talk is going to be about quantum cryptography and quantum computing.
(snip)
00:46.000 --> 00:49.000
So let's get right into it.

Slide one’s duration is 49 seconds and slide two should be shown at 00:00:49.000; collect this data for all the slides. Next, make text file for each slide that includes the slide file name (the .png file created earlier) and the duration (seconds on screen, not the absolute time). The files should look like this:

file 'QuantumComputingSlides3.png'
duration 6
file 'QuantumComputingSlides3.png'

Slide 3 will be shown for 6 seconds. The file name appears twice around the duration because of some quirk in ffmpeg’s concat function. You should now have in your folder each slide as a .png file and for each slide a .txt file that describes the duration in file system sortable order (slide01.txt, slide02.txt etc).

I put the video encoding command in a bash script to make it a bit easier:

#!/bin/bash
# AV1 compression single pass
for i in *.txt
    do name=`echo "$i" | cut -d'.' -f1`
    ffmpeg -hide_banner -f concat -i "$i" -y -vf fps=10 -c:v libsvtav1 -pix_fmt yuv420p10le -preset 3 -svtav1-params tune=0:color-range=1:keyint=60000:scm=1 -b:v 0 -crf 40 -an "${name}.webm"
done

what the settings mean:

-vf fps=10           set the output video frame rate to 10fps.   Why? Because timing works out. 5 might work too.
-c:v libsvtav1       use the intel encoder, it is like 10x faster than libaom.
-pix-fmt yuv420p10le use 10 bit encoding which makes gradients and dark areas better at a small cost, might as well.
-preset 3            this determines encoding effort.  3 was manageable. 2 took a loong time. YMMV
-stvav1-params       this passes parameters through ffmpeg to the CODEC
  tune=0             tuned for content rather than PSNR tests
  color-range=1      full (computer) color rather than studio color
  keyint=60000       don't put in extra keyframes at all, just starting I then all P frames
  scm=1              peeps say 1 is good for digital graphics and maybe animation, 0 is default for live action
-b:v 0               don't limit bandwidth (quality control only)
-crf 40              this is a very low target quality because there's no motion to worry about
-an                  no audio (for now)

Note that since iOS devices can’t do AV1 yet, it might be preferable to use the less efficient VP9 either as the sole version or as a fallback. This can be done with:

#!/bin/bash 
# VP9 compression single pass 
for i in *.txt 
   do name=`echo "$i" | cut -d'.' -f1` 
   ffmpeg -hide_banner -f concat -i "$i" -y -vf fps=10 -c:v libvpx-vp9 -pix_fmt yuv420p10le -deadline best -cpu-used 0 -b:v 0 -crf 40 -g 60000 -an "${name}.webm" 
done

The options mean

-deadline best       means to use the highest quality, slowest encoding
-cpu-used 0          default is 0, but never hurts to be sure, best quality
-g 60000             libvpx-vp9 uses the ffmpeg "g" to set keyframe intervals
-lossless 1          seems like a good thing for slides, but yields quite large files

save and chmod+x and then execute

UPDATE

I had trouble with ffmpeg slide timing with the concat command, something I am apparently not alone in. I rewrote the conversion shell script to be a lot more robust, this reads a tab-delimited text file called files_to_encode.txt in a format of

file_one.png\t3.12
file_two.jpg\t25.344
...

and then executes ffmpeg to convert the images to single keyframe (and no intermediate frame) video files with one frame duration accuracy of the duration values – that is fractional values are allowed. The default of 20fps should be within 0.05 seconds of the target length.

#!/bin/bash

# Base directory - change this to your absolute path
BASE_DIR="/home/gessel/Work/Slocumisms/IPHROS/antigua/Short_Presentation/pages"

# Add logging
LOG_FILE="$BASE_DIR/encoding_log.txt"
PROCESSED_FILES="$BASE_DIR/processed_files.txt"

# Initialize log file with timestamp
echo "=== Encoding session started at $(date) ===" > "$LOG_FILE"

# Change to base directory
cd "$BASE_DIR" || {
    echo "Error: Cannot change to base directory $BASE_DIR" | tee -a "$LOG_FILE"
    exit 1
}

# Function to convert seconds to HH:MM:SS.msec format
convert_to_hms() {
    local total_seconds=$1
    local seconds=${total_seconds%.*}
    local msec=${total_seconds#*.}
    msec=$(printf "%-3s" $msec)
    msec=${msec// /0}
    local hours=$((seconds / 3600))
    local minutes=$(((seconds % 3600) / 60))
    local secs=$((seconds % 60))
    printf "%02d:%02d:%02d.%s" $hours $minutes $secs $msec
}

# First, ensure the input file is in Unix format
dos2unix -n "$BASE_DIR/files_to_encode.txt" "$BASE_DIR/files_to_encode.unix.txt"

# Read all lines into an array
mapfile -t lines < "$BASE_DIR/files_to_encode.unix.txt"

# Process each line
for ((i=0; i<${#lines[@]}; i++)); do
    line_number=$((i + 1))
    line="${lines[$i]}"
    
    # Debug logging for raw line
    echo "DEBUG: Line $line_number raw: '$line'" >> "$LOG_FILE"
    echo "DEBUG: Hex dump of line:" >> "$LOG_FILE"
    echo -n "$line" | xxd >> "$LOG_FILE"
    
    # Skip empty lines
    if [ -z "$line" ]; then
        echo "Line $line_number: Empty line, skipping" | tee -a "$LOG_FILE"
        continue
    fi
    
    # Split the line using parameter expansion
    filename="${line%%$'\t'*}"
    duration="${line#*$'\t'}"
    
    # Debug logging after split
    echo "DEBUG: After split:" >> "$LOG_FILE"
    echo "  Filename: '$filename'" >> "$LOG_FILE"
    echo "  Duration: '$duration'" >> "$LOG_FILE"
    
    # Validate input file exists
    if [ ! -f "$filename" ]; then
        echo "Line $line_number: Input file '$filename' not found" | tee -a "$LOG_FILE"
        continue
    fi
    
    # Extract base filename without extension
    base_filename="${filename%.*}"
    
    # Check if output file already exists
    if [ -f "${base_filename}.webm" ]; then
        echo "Line $line_number: Skipping ${filename} - output file already exists" | tee -a "$LOG_FILE"
        continue
    fi
    
    # Validate duration format
    if ! [[ "$duration" =~ ^[0-9]+(\.[0-9]+)?$ ]]; then
        echo "Line $line_number: Invalid duration format '$duration'" | tee -a "$LOG_FILE"
        continue
    fi
    
    # Convert duration to HH:MM:SS.msec format
    duration_hms=$(convert_to_hms "$duration")
    
    echo "Line $line_number: Processing $filename with duration $duration_hms" | tee -a "$LOG_FILE"
    
    # Execute ffmpeg command with reduced verbosity
    if ffmpeg -hide_banner -loglevel error -loop 1 -framerate 1/20 -i "$filename" \
        -vf fps=20 -c:v libvpx-vp9 -pix_fmt yuv420p10le -deadline best -cpu-used 0 \
        -b:v 0 -crf 40 -g 60000 -ss 00:00:00.000 -t "$duration_hms" -an \
        "${base_filename}.webm" 2>> "$LOG_FILE"; then
        
        echo "Line $line_number: Successfully encoded $filename" | tee -a "$LOG_FILE"
        echo "$filename" >> "$PROCESSED_FILES"
    else
        echo "Line $line_number: Error encoding $filename" | tee -a "$LOG_FILE"
    fi
    
    sleep 1
done

# Clean up temporary file
rm "$BASE_DIR/files_to_encode.unix.txt"

echo "=== Encoding session completed at $(date) ===" >> "$LOG_FILE"

I also learned something new as my latest project was recorded from text, rather than converted from a live talk, that recording each slide’s audio text as an individual track in Audacity and then make sure the extras menu is enabled and use Extra->Scriptables II -> Get Info… and set type to Tracks and Format to Brief to get the slide timing info in start seconds/stop seconds. Use calc to subtract the per track start time from the end time to get duration.

And a practical hint, normalizing audio tracks to -3dB is a pretty standard expectation, but I find that the perceived loudness is somewhat random even if the peak of each track is -3dB. Using Effect -> Volume and Compression -> Loudness Normalization and setting perceived loudness to somewhere between -16 and -22 LUFS gives much more consistent perceived loudness (who’da’thunk?). I recommend starting at about -16 LUFS and then checking for any clipping and then decreasing by steps of -1 or -2 LUFS until there’s no clipping indicated in the waveform (indicated in red in the normal interface). This seems to result in more consistent audio tracks than normalizing to peak amplitude.

Once this finishes (and it will be a while with AV1) there will be a video file of the right number of seconds for each .png file. For my slides, the video clips are about 60-80% the size of the original png slides, because AV1 is much more efficient than .png even for still compression (like webp, based on vp9, which AV1 is the successor to).

The next step is to concatenate all the slide videos into a single video stream. First we create video list file from the folder of webm files like:

for f in *.webm; do echo "file '$f'" >> vidlist.txt; done

then we use ffmpeg again to merge the list into a single file like:

ffmpeg -hide_banner -f concat -safe 0 -i vidlist.txt -c copy slideshow.webm

what the command means:

-c copy              Video streams are direct copied, no re-compression.

To verify the stream parameters you can use

ffprobe -hide_banner -select_streams v -show_entries frame=pict_type,pts_time -of csv=p=0 -i slideshow.webm | grep -v P

This should show a key frame at the cumulative seconds count (not H:M:S.MS format) for each slide change and no others (assuming there’s < 100 minutes per slide). Note this makes seeking really slow (REALLY slow) like 5 seconds to jump but to each slide is close to instant. You could use a standard value like “150” for the keyint meaning a keyframe every 15 seconds to speed up searching but at the cost of a lot of filesize.

The original png files were 9.3 MiB, the AV1 video conversion is 2.9MiB and the VP9 conversion is 3.1MiB. For a slide show, I’d argue that AV1 isn’t likely to be worth the extended encode time and compatibility issues, but YMMV and it is worth doing tests as results are very content dependent.

At this point you should have a video file without any audio and still have your .wav file plus your timing file. Next we’re going to add (back) the audio.

Adding Audio Back

This is a fairly straight forward, but we have to compress using an allowed codec for webm. I also keep the single channel and use a moderate data rate for speech.

ffmpeg -hide_banner -i slideshow.webm -i audio.wav -map 0 -map 1 -c:v copy -c:a libopus -b:a 48k -ac 1 presentation.webm

What the parameters mean:

-c:a libopus         Use libopus, an allowed audio codec in webm
-b:a 48k             Compress at 48kbps, this is quite good for speech 
-ac 1                One audio track.  If you're doing stereo, then default is fine

Now you have an audio video file, synced audio and slides and should be quite compact whether in VP9 or AV1, but it can be nice to add some metainformation including subtitles and chapter headings.

Adding subtitles back.

I suggest giving the whisper produced .vtt file at least a cursory edit. It is quite good, but can have trouble with homophones, which is understandable, especially with technical jargon. Once you’re happy with the text, you can merge the subs back into the webm container, tag the audio stream and subs with languages using:

ffmpeg -hide_banner -i presentation.webm -i audio.vtt -map 0:v -map 0:a -map 1:s  \
-metadata:s:a language=eng -metadata:s:s:0 language=eng  -c copy -y preso-sub.webm

what the parameters mean:

-map 0:v                     use the video from index 0 (first input)
-map 0:a                     use the audio from index 0 (first input)
-map 1:s                     use subtitles from index 1 (second input)
-metadata:s:a language=eng   the audio is english (or pick your lang)
-metadata:s:s:0 language=eng the subs are english (or pick your lang)

Now your video file has subtitles and these should be selectable in VLC player

Add metadata and chapters with MKVToolNix

Chapter data and additional metadata is our first foray away from ffmpeg to another open source tool called MKVToolNix. Chapter data is easiest (and most reusable) by creating a chapter file using the slide timing data collected previous, but this time in absolute time in HH:MM:SS.sss so it looks like this (with the blank chapter at the out time of the whole video):

CHAPTER01=00:00:00.000
CHAPTER01NAME="Quantum Computing and Cryptography"
CHAPTER02=00:00:28.000
CHAPTER02NAME="1.0 Topics to be Covered"
...
CHAPTER22=00:50:07.000
CHAPTER22NAME="6.0 Practical Implementations"
CHAPTER23=00:51:44.000
CHAPTER23NAME="7.0 Conclusion: Gessel's Law P=2^2^(Y/2) rev: P=2^2^(Y/3.4)"
CHAPTER24=00:54:50.000
CHAPTER24NAME=""

You open this chapters.txt file in the Chapter Editor tab of MKVToolNix and then right click and select “additional modifications” then select the language. Finally, save from the “Chapter editor” menu (top of screen) and select “Save to Matroska or WebM file” and confirm that you’re going to overwrite the no-chapters version (with the addition of the chapter data).

The last tidbit is to add some moderately useful metainformation, at least title and possible date (if relevant). Title, at least, is what’s used as the VLC title and possibly in other places. The MKVToolNix header editor tab will do what’s needed. You want to edit the “Segment information” – I’m not sure where the track information titles show up, so I don’t bother with them, but no harm in editing those either. Then just save with CTRL-S to update your WebM video with the additional metadata.

That’s it, you should now have a well-formatted, searchable, indexible video that will play directly from your own web server without relying on plugins or gifting your data to services like youtube or tiktok or whatever data harvesting service is luring the unwitting to data slaughter with Judas goats of convenience.

Gessel On…