More than words – Designing and building voice interfaces

Fresh from the WebExpo 2023 stage, Léonie Watson explores the importance of the voice behind a brand, using synthetic speech and the many applications in the modern world.

In her introduction, Léonie highlights how having a recognisable sonic logo behind a brand can be just as impactful as a brand logo itself. Using examples from the likes of Intel and McDonald’s whose sonic logos are both easily recognisable worldwide.

Alongside the use of sonic logos, brands often use an easily recognisable voice interface within their advertising, something that Léonie delves into in further detail in this video.

The technology

The recording covers a number of topics at greater length, which are detailed below. But, to begin with, Léonie will take you through some of the technology behind voice interfaces to give you an understanding of how they are used, including:

The architecture of voice interfaces and the types most commonly available. Those are:
- In the cloud – like the likes of Echo, Google Home, etc, or custom applications;
- On device – such as screen readers; and
- Browser-based interfaces.

Synthetic speech, how it is created and the history behind it.
- Text To Speech (TTS) that is often used on various devices that we use in our everyday lives.
- Digital text-to-speech conversion – something that has evolved over time and, though early on it was used on a very basic level, now we have the ability to alter the gender, tone and frequency to suit our needs.

TTS Timeline. The speaker will also take you through a history of synthetic speech and how it has developed over time, covering:
- Formant TTS – formant synthesis is rule-based. The acoustic characters of speech are extracted from a voice recording, then programmes as rules for recreating the voice as digital audio.
- Concatenative TTS – which uses a voice recording that’s broken down into tiny chunks. Those chunks are then reconstructed from a database to form the synthesised speech. Whilst this is somewhat improved, it still lacks some humanity.
- Parametric TTS – parametric synthesis uses a statistical model. The parameters of speech are extracted from a voice recording, then Hidden Markov Models (HMM) modify the parameters based on the text to be spoken.
- Neural TTS – the most up-to-date synthetic speech available and uses artificial neural networks trained on huge amounts of voice data. The neural networks earn the relationship between text and speech, and the result is voices that sound very close to human speech.

Prosody and emotion within TTS

Prosody is the human element that is applied in synthetic speech. Prosody is the variables that differentiate one voice from another, such as pitch, volume etc. Prosody can also be used to alter the tone of a particular word or phrase to change the emphasis, and therefore the meaning behind a phrase or sentence. Léonie expands on this within the video.

Utilising emotion in synthetic speech is also important to brand voice, we will cover the ways that it can be used, though it is currently in its early stages.

Finding your voice

Léonie then delves into the things that you will want to consider as you establish your own brand voice using synthetic speech. Whilst this is a brief list, the speaker provides a full breakdown of each point within the WebExpo video.

Gender
Accent
Prosody
Emotion

But what are we missing?

Whilst we can diligently navigate the successful development of synthetic speech over time, we believe there is still room for improvement. Something that Léonie covers in greater detail in the recording below.

The belief is that we are missing a vital piece of the voice puzzle. In that, there is currently no way to alter the way in which content, such as our own blogs or your own articles, is read from the web readers. In essence, you cannot alter the tone, gender, prosody or any of the other elements that are covered in the video. Meaning that if your content is being read aloud to a consumer, as opposed to them physically sitting and reading the material themselves, there is no way for the key brand voice to truly shine through. Doing the content is a great injustice.

So, if you’re interested in establishing your own brand voice using synthetic speech, or if you also think that we need to close the gap that still very much exists, then take some time to watch the recording in full.