Speech Recognition and Text-to-Speech using ML. Part 2: going offline

In the first part we have set up a very simple but efficient voice assistant which can listen to your commands and react on them. It starts up as soon as internet and audio appears in system. And yes, this is the limitation of our very first implementation - it needs internet as it uses Google TTS and STT engines.

But this time, we'll try to replace them with something which can be executed 100% offline. And I have couple of candidates, apart from the ones we talked before

Mozilla Deep Speech

The most recent version of DeepSpeech right now is v.0.9.3 and it requires Python 3.7. whereas we're running 3.9 in our Debian. I found it out, when I attempted to install the deepspeech-0.9.3-cp37-cp37m-linux_aarch64.whl and it was giving me this confusing message:

pi@orangepi4-lts:~/projects $ pip3 install https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-cp37-cp37m-linux_aarch64.whl
ERROR: deepspeech-0.9.3-cp37-cp37m-linux_aarch64.whl is not a supported wheel on this platform.

As I figured it out, I should have checked --verbose flag and then it became more apparent that it attempted to install things under 3.7 directories, whereas current version of python was 3.9. Python is such a mess.  

As Debian repo doesn't have 3.7 I decided to build it myself. It's done pretty straightforward: download -> configure -> make -> make install. I just crosschecked the very first hit I found on google and it went just fine.

To be continued

https://deepspeech.readthedocs.io/en/v0.9.3/

coqui-ai TTS and SST

https://github.com/coqui-ai

https://tts.readthedocs.io/en/latest/

Coqui-Ai TTS - Text to speech

I was playing with TTS (text-to-speech) using my Intel Atom Z8350 tablet - Asus Transformer Book T102HA, as I'm on vacation right now, away from my little SBC device at home. It was almost a smooth ride. Saying "almost", because in Windows 10 it took me a while to understand two main dependencies:

1) you need to have python 3.9, not 3.10

2) you need to install windows 10 sdk and build tools, using "Visual Studio Installer". And that thing brings like 5 additional Gb of various DLLs, header files and tools to your Windows 10. Kind a crazy.

Eventually you can do:

pip3 install tts

And check this out:

tts --text "This is another test using tacotron2"          --out_path speech_tacotron2.wav           --model_name "tts_models/en/ek1/tacotron2"

tts --text "This is another test using tacotron2-DDC"      --out_path speech_tacotron2-DDC.wav       --model_name "tts_models/en/ljspeech/tacotron2-DDC"

tts --text "This is another test using tacotron2-DDC_ph"   --out_path speech_tacotron2-DDC_ph.wav    --model_name "tts_models/en/ljspeech/tacotron2-DDC_ph"

tts --text "This is another test using glow-tts"           --out_path speech_glow.wav                --model_name "tts_models/en/ljspeech/glow-tts"

tts --text "This is another test using speedy-speech"      --out_path speech_speedy.wav              --model_name "tts_models/en/ljspeech/speedy-speech"

tts --text "This is another test using tacotron2-DCA"      --out_path speech_tacotron2.wav           --model_name "tts_models/en/ljspeech/tacotron2-DCA"

tts --text "This is another test using vits"               --out_path speech_vits.wav                --model_name "tts_models/en/ljspeech/vits"

tts --text "This is another test using fast_pitch"         --out_path speech_fast_.wav               --model_name "tts_models/en/ljspeech/fast_pitch"

tts --text "This is another test using vits"               --out_path speech_vits.wav                --model_name "tts_models/en/vctk/vits"

tts --text "This is another test using fast_pitch"         --out_path speech_fast_.wav               --model_name "tts_models/en/vctk/fast_pitch"

tts --text "This is another test using tacotron-DDC"       --out_path speech_tacotron.wav            --model_name "tts_models/en/sam/tacotron-DDC"

tts --text "This is another test using capacitron-t2-c50"  --out_path speech_capacitron-t2-c50.wav   --model_name "tts_models/en/blizzard2013/capacitron-t2-c50"

tts --text "This is another test using capacitron-t2-c150" --out_path speech_capacitron-t2-c150.wav  --model_name "tts_models/en/blizzard2013/capacitron-t2-c150"

This will download a number of different voice and vocoder models (if something fails for you, check the output of tts --list_models command first, and adjust the above) and synthesize the phrase given as --text parameter using them.

You can then go and listen the result, using your favorite media player, or download SoX to use aplay.

Given I was getting very unstable performance figures, like what time does it take to generate the same sentence using different combinations of models/vocoders, I drafted a very simple script to try them all out multiple times, so I'll have a more data to compare - https://github.com/kha84/tts-comparison

I figured it out, that for my hardware the balance between best performing and most accurate  spelling was to use tts_models/en/ljspeech/glow-tts

Tensor Flow TTS

https://github.com/TensorSpeech/TensorFlowTTS


AprirASR TTS

https://github.com/abb128/april-asr

Used in LiveCaptions, a wonderful application for Gnome to generate English captions from the played audio

OpenAI Whisper TTS

https://github.com/openai/whisper

Whisper.CPP TTS

A re-implementation of OpenAI's Whisper but on C++, which is leaner to resources

https://github.com/ggerganov/whisper.cpp

Mimic3 TTS

https://github.com/MycroftAI/mimic3

MaryTTS

http://mary.dfki.de/

No comments:

Post a Comment

Start here

Disable Firefox from updating itself and flash those annoying "Restart to Keep Using Firefox" messages on you

I recently switched from Brave to Firefox. Just because Brave appeared to be some proprietary shit, even though they're masking themselv...