Converting audio to text from command line using commercial and opensource tools

Girl speaking in Mic, by Dall-E

For a while now, I have been wanting to see the options I have today to convert speech to text at the command line. Lots of commercial options exist today for doing that, but I also wanted to check out open source options.

In this experiment, I decided to compare:

Table of Contents

spchcat

spchcat describes itself as:

spchcat is a command-line tool that reads in audio from .WAV files, a microphone, or system audio inputs and converts any speech found into text.

It runs locally on your machine, with no web API calls or network activity, and is open source.

It is built on top of Coqui’s speech to text library, TensorFlow, KenLM, and data from Mozilla’s Common Voice project.

It seems to be written in C, and only has pre-built debian packages for amd64 and arm (for Raspberry pi).

Interestingly, it can stream convert audio from a microphone to text, which lets us do creative real time transcription applications.

Google Cloud’s ML Speech API

Google Cloud’s ML Speech API lets you use the gcloud CLI tool to transcribe audio from the command line. It has streaming abilities as well.

It is pretty useful for anybody to play around with speech-to-text without paying, as it has generous enough free limits for dabbling.

New customers get $300 in free credits to spend on Speech-to-Text. All customers get 60 minutes for transcribing and analyzing audio free per month, not charged against your credits.

Audio samples that I tried out

I used two different audio samples to compare the tools.

Sample one from Kaggle

The first one is a short (18s) audio sample from a Kaggle data set. They seem to be Harvard Sentences, used in speech quality measurements. [Src]

Actual transcript:

The stale smell of old beer lingers. It takes heat to bring out the odor. A cold dip restores health and zest. A salt pickle tastes fine with ham. Tacos Al Pastor are my favorite. A zestful food is the hot-cross bun.

Sample two from a speech by Gandhi

I also wanted to test how well these two software do with non-native English accents. For this, I picked up the audio from Mahatma Gandhi’s speech at Kingsley Hall in London, on October 1931. [wikimedia src]

Actual transcript of the first 60 seconds of the audio (reason explained later in the post):

There is an indefinable mysterious power that pervades everything. I feel it though I do not see it. It is this unseen power which makes itself felt and yet defies all proof, because it is so unlike all that I perceive through my senses. It transcends the senses. But it is possible to reason out the existence of God to a limited extent. Even in ordinary affairs we know that people do not know who rules, or why, and how He rules and yet they know that there is a power that certainly rules.

In my tour last year in Mysore I met many poor villagers and I found upon inquiry that they did not know who ruled Mysore. They [t58] simply said some [t60] God ruled it. …. (full text in the link below)

Mahatma Gandhi’s famous speech at Kingsley Hall in 1931

Prepping the audio samples

Both the apps didn’t like the mp3 format audio that I had. WAV files are universally supported. This made me convert the audio format using ffmpeg.

$ ffmpeg -i harvard.mp3 harvard.wav

But gcloud also refused to process stereo audio.

$ gcloud ml speech recognize harvard.wav  --language-code=en-US
ERROR: (gcloud.ml.speech.recognize) INVALID_ARGUMENT: Must use single channel (mono) audio, but WAV header indicates 2 channels.

So I had to convert the audio to mono adding the -ac 1 parameter to ffmpeg

ffmpeg -i harvard.mp3 -ac 1 harvard.wav

For the Gandhi speech, gcloud also asked for extra configuration for audio longer than 60 seconds.

$ gcloud ml speech recognize gandhi.wav    --language-code=en-US
ERROR: (gcloud.ml.speech.recognize) INVALID_ARGUMENT: Sync input too long. For audio longer than 1 min use LongRunningRecognize with a 'uri' parameter.

So I truncated the audio to 60 seconds using ffmpeg.

$ ffmpeg -i gandhi.mp3 -t 60 -ac 1 gandhi_t60.wav

Weirdly, gcloud still complained of the length of the audio. So I finally truncated it to 58 seconds, and that worked.

Trying out the first sample

This was a short speech by a native English speaker, speaking kind of slowly and carefully, presumably for clarity.

spchcat handled most of it correctly. But completely truncated the last spoken sentence. I am not sure if this is a misconfiguration at my end or a problem with the software.

$ spchcat ./harvard.wav
TensorFlow: v2.3.0-14-g4bdd3955115
 Coqui STT: v1.1.0-0-gf3605e23
rate: rate clipped 13 samples; decrease volume?

the stale smell of old bear lingers it takes heat to bring out the odor a cold dip restores health and zest a salt pickled taste fine with ham tackles all pastor are my favorite existlessness

glcoud was completely on point.

$ gcloud ml speech recognize harvard-mono.wav   --language-code=en-US
{
  "requestId": "4451273191690268245",
  "results": [
    {
      "alternatives": [
        {
          "confidence": 0.9559944,
          "transcript": "the stale smell of old beer lingers it takes heat to bring out the odor a cold dip restores health and zest a salt pickle taste fine with ham tacos al Pastore are my favorite a zestful food is the hot cross bun"
        }
      ],
      "languageCode": "en-us",
      "resultEndTime": "18s"
    }
  ],
  "totalBilledTime": "18s"
}
Note, that in the rest of this post, I will be using fold to wrap the long lines and jq to just extract the interesting text from json.

Trying out the second sample

spchcat started well, and again completely botched it at the end. It makes no sense that this is a problem with its speech detection. If I have to guess, it probably has got to do with stream buffering sync issues.

$ spchcat ./gandhi_t60.wav | fold -s
TensorFlow: v2.3.0-14-g4bdd3955115
 Coqui STT: v1.1.0-0-gf3605e23
there is an indefinable mysterious power that pervades every thing i feel it
though i do not see it it is this unseen power which makes itself felt and yet
defies all proof because it is so unlike all that i perceive through my senses
it transcends the senses but it is possible to reason out the existence of god
to his relatedness know that people do not know who rose or hand how he does
and yet they know that there is a power let certainement very poor pieni found
upon inquiry that he did not know who ruled my sore the simple

gcloud did also started reasonably well. But completely fudged it towards the end, missing whole words (“Even in ordinary affairs”), and then completely mixing them up.

$ gcloud ml speech recognize gandhi_t58.wav --language-code=en-US \
  | fold -s
  "there is an indefinable mysterious power that pervades everything I feel it so
  I do not see it it is this unseen power which makes itself heard and yet
  devised all proof because it is so unlike all that I perceive through my senses
  it transcends the changes but it is possible to reason out the existence of God
  to a limited extent we know that people do not know who rules or why and how he
  rules and yet they know that there is a power that certainly who's in my tour
  last year in My Sword I made many poor villagers and I found upon inquiry that
  they did not know who ruled my soul"

I tried some additional parameters to see if they fix the problems with gcloud:

Punctuation in transcription

All transcription seems bereft of any punctuation - sentences run into each other.

I tried the automatic punctuation feature in gcloud, and it sorta worked in the second sample. Even in the smaller sample with clear native speech, it didn’t do so well.

$ gcloud ml speech recognize harvard-mono.wav  --language-code=en-US \
  --model latest_long --enable-automatic-punctuation \
  | jq '.results[].alternatives[].transcript'| fold -s
"the stale smell of old beer lingers"
" it takes heat to bring out the odor."
" A cold dip restores health and zest a salt pickle tastes fine with ham tacos
al pastor are my favorite a zestful food is the hot cross bun."


$ gcloud ml speech recognize gandhi_t58.wav --language-code=en-US \
  --model latest_long --enable-automatic-punctuation \
  | jq '.results[].alternatives[].transcript'| fold -s
"There is an indefinable mysterious power that pervades everything. I feel it
though. I do not see it it is this unseen power which makes itself felt and yet
defies all proof because it is so unlike all that. I perceive through my
senses. It transcends the senses but it is possible to reason out the existence
of God to a limited extent even in ordinary Affairs. We know that people do not
know who rules or why and how he rules and yet they know that there is a power
that certainly rules in my tour last year in my soul. I met many poor villagers
and I found upon inquiry that they did not know who ruled my soul."

Conclusion

In my two tests, gcloud is definitely more accurate than spchcat. But just two audio samples are ridiculously small data points to derive conclusions on these tools. Accuracy will definitely be improved with the accent, but also there are many other tunables to explore with both these tools, that I haven’t yet touched.

But given the cost factor of gcloud, playing around a lot for free with transcription is probably only feasible with spchcat, so we should definitely not rule it out.

This is just my initial foray in this domain. I might revisit these samples after I understood some of the underlying theories better.

References

tech
Decrypting and concatenating PDFs with qpdf Age encryption cookbook