Speech is generally a special class of audio files where compression quality is rated more on intelligibility than on fidelity, though the two related the former may be optimized at the expense of the latter to achieve very low data rates. A few codecs have emerged as particularly adept at this specific class: Speex, Opus, and the latest, Google’s Lyra, a deep learning enhanced codec.
Lyra is focused on Android and requires a bunch of Java cruft to build and needs debugging. It didn’t seem worth the effort, but I appreciate the Deep Learning based compression, it is clearly the most efficient compression possible.
I couldn’t find a quick whatcha-need-to-know is kind of summary of the codecs, so maybe this is useful:
Opus
On Ubuntu (and most Linux distros) you can install the Opus codec and supporting tools with a simple
# sudo apt install opus-tools
If you have ffmpeg
installed, it provides a framework for dealing with IO and driving libopus
from the command line like:
# ffmpeg -i infile.mp3 -codec:a libopus -b:a 8k -cutoff 8000 outfile.opus
Aside from infile.(format)
and outfile.opus
, there are two command line options that make sense to mess with to get good results: the bitrate -b:a (bit rate)
and the -cutoff (frequency)
, which must be 4000
(narrowband), 6000
(mediumband), 8000
(wideband), 12000
(super wideband), or 20000
(fullband). The two parameters work together and for speech limiting bandwidth saves bits for speech.
There are various research papers on the significance of frequency components in speech intelligibility that range from about 4kHz to about 8kHz (and “sometimes higher”). I’d argue useful cutoffs are 6000 and 8000 for most applications. The fewer frequency components fed into the encoder, the more bps remain to encode the residual. There will be an optimum value which will maximize the subjective measure of intelligibility times the objective metric of average bit rate that has to be determined empirically for recording quality, speaker’s voice, and transmission requirements.
In my tests, my sample, the voice I had to work with an 8kHz bandwidth made little perceptible difference to the quality of speech. 6kbps VBR (-b:a 6k
) compromised intelligibility, 8k did not, and 24k was not perceptibly compromised from the source.
one last option to consider might be the -application
, which yields subtle differences in encoding results. The choices are voip
which optimizes for speech, audio
(default) which optimizes for fidelity, and lowdelay
which minimizes latency for interactive applications.
# ffmpeg -i infile.mp3 -codec:a libopus -b:a 8k -application voip -cutoff 8000 outfile.opus
VLC player can play .opus files.
Speex
AFAIK, Speex isn’t callable by ffmpeg
yet, but the speex installer has a tool speexenc
that does the job.
# sudo apt install speex
Speexenc only eats raw and .wav files, the latter somewhat more easily managed. To convert an arbitrary input to wav, ffmpeg is your friend:
# ffmpeg -i infile.mp3 -f wav -bitexact -acodec pcm_s16le -ar 8000 -ac 1 wavfile.wav
Note the -ar 8000
option. This sets the sample rate to 8000 – Speexenc will yield unexpected output data rates unless sample rates are 8000
, 16000
, or 32000
, and these should correlate to the speexenc bandwidth options that will be used in the compression step (speexenc
doesn’t transcode to match): -n
“narroband,” -w
“wideband,” and -u
“ultrawideband”
# speexenc -n --quality 3 --vbr --comp 10 wavfile.wav outfile.spx
This sets the bandwidth to “narrow” (matching the 8k input sample rate), the quality to 3 (see table for data rates), enables VBR (not enabled by default with speex, but it is with Opus), and the “complexity” to 10 (speex defaults to 3 for faster encode, Opus defaults to 10), thus giving a pretty head-to-head comparison with the default Opus settings.
VLC can also play speex .spx files. yay VLC.
Results
The result is an 8kbps stream which is to my ear more intelligible than Opus at 8kbps – not 😮 better, but 😐 better. This is atypical, I expected Opus to be obviously better and it wasn’t for this sample. I didn’t carefully evaluate the -application voip
option, which would likely tip the tables results. Clearly YMMV so experiment.
Speex quality vs bandwidth bitrates
-quality | -n | -w | -u |
---|---|---|---|
3 | 8 | 9.8 | 11.6 |
4 | 8 | 12.8 | 14.6 |
5 | 11 | 16.8 | 18.6 |
6 | 11 | 20.6 | 22.4 |
7 | 15 | 23.8 | 29.6 |
8 | 15 | 27.8 | 29.6 |
9 | 18.2 | 34.2 | 36 |
10 | 24.6 | 42.2 | 44 |
Leave a Reply
You must be logged in to post a comment.