Speech Recognition (STT) and Voice Synthesis (TTS)

Overview

To enable your NLP to process audio signals produced by your user, as well as respond in a natural way, it leverages 3rd party vendor ​Speech-to-Text​ (​STT​) and ​Text-to-Speech​ (​TTS​) functionality, respectively.

An overview of the data flow can be found in NLP Orchestration Implementation section.

Audio utterances from the end user are transmitted via ​WebRTC​ and are directed to the desired 3rd party STT API for conversion to text. The resulting text is then directed to the NLP, as the end user query and processed.

Soul Machines supports Google,​ IBM, AWS and Microsoft ​​STT services.
Soul Machines also supports phrase hints (Google and AWS) for STT services. Phrase hints increase the likelihood that the STT will correctly transcribe domain-specific words and phrases, refer to  the relevant 3rd party vendor documentation. Your Soul Machines technical contact can help with this.

Text responses generated by the NLP are first pre-processed by your Soul Machines Persona Server for parsing of ​Emotional Markup Language​ (​EML​), with the remaining static text processed by the TTS service. The final output is the voiced content, uttered by your Digital Person​. Soul Machines​ supports ​Google, ​Microsoft, and ​AWS Polly ​TTS services. Soul Machines supports a number of voice services configuration options including; language, accent, speed of dialogue, and other voice components.

The entire speech stack (STT/NLP/TTS) is hosted by the Soul Machines customer’s chosen provider, and API keys for these services must be provided to Soul Machines, as described in the NLP Integration​​ section.