Voice Technology That's More Human Than Human

Leor Grebler

- Last Updated: December 2, 2024

Leor Grebler

- Last Updated: December 2, 2024

The phone rings and you answer. It's your mother. In a very direct way, she tells you she wants a photo book of your last trip together for her birthday, along with a card with kittens on the front, and hydrangeas. Write neatly, she says. The phone then disconnects.

Okay, that actually wasn't your mother. That was your mother's bot, who was doing her a favor and calling to remind you of her birthday. All she had done was answer "yes" to her own AI assistant's question, "Would you like to do something for your birthday next month?"

While this is a benign scenario of a slight runaway AI, we're likely going to be dealing with questions around this in the next 2-5 years when AI’s abilities in areas we thought were human-only domains slowly surpass humans and to such a degree that we start to need to use them to augment our own abilities and keep up.

First, it was Big Blue defeating Kasparov at Chess, then it was Watson winning over Jenner. Now Go is gone. Poker is also un-winnable. While these examples were great exhibitions, we're going to see the result of these losses encroach in our day to day lives.

It's likely that the augmentation of our own knowledge through AI tools in language and voice processing are going to be necessary to be able to keep up with the flow of information. As AI philosophers such as Sam Harris have speculated, there might end up being an AI gap similar to a wage gap, where those with access to super AIs collect more ability.

Speech Recognition

Over the past year, voice and language technologies have started to hit new milestones. Microsoft announced last year that its speech transcription capabilities had surpassed human error levels. What's happened since then is that tens of millions of new voice first devices have flooded the market and are capturing thousands of hours of sample audio to further learn every day.

While speech recognition for simple commands is getting better, the improvements will trickle down to other areas of speech recognition such as noise suppression, far field voice interaction, and even multi-speaker interaction. The result of these improvements means that low cost microphones can be embedded into more products with less of a requirement for powering digital signal processing chips.

Speech recognition will also extend to achieve beyond human error rates in most languages and then to the edges of speech. These include being able to understand what we say when we sing, whisper, shout, talk in a different accent or like a duck, or even belch out our words.

Taken to an extreme, our devices will be able to understand us when we're talking over each other, in a crowded room, and be able to process all of these interactions concurrently.

More Insights Through Voice

We're starting to see new technologies become available through voice that will soon surpass humans, the first being emotion recognition. Our devices will be able to very quickly assess our moods and soon, be able to determine them more accurately than we can determine each others' moods. The error rate will drop below our own and we might be using these tools to determine whether people around us are truthful.

Our devices, by comparing our voice against millions of others, will also become our own Henry Higgins - knowing where we're from based on how we speak, or even what languages and cultures we've been exposed to.

Do we have any particular accents or traits? Are there certain idioms that we use that are similar to others? Expressions? Language quirks? We can maybe soon distinguish to what percentage someone speaks like a Bostonian vs a New Yorker vs a Minnesotan.

While the basics like age, gender, and even ethnicity can be pulled out of speech samples, some companies are now looking at health assessment based on voice. We know what it sounds like when someone has a head cold, is losing their voice from coughing - but what about early signs of things like throat cancer or more benign conditions like lower hydration?

The tinder in your voice could even reveal mental stability or trauma.

The Beating of That Hideous Heart

The same microphones that are used to analyze voice could also be tuned to listen to other environment features or biological events. How many times did you go to the bathroom today (yes, TMI)? But super hearing might also be able to hear your heartbeat from across the room.

And with heart rate comes other insights like heart variability, respiration rates, and even sounds of frustration. How lightly are you stepping? What's your activity during the day? Are you slamming the door? All of this insight can be gleaned and documented, and then compared to millions of others to provide back actionable insights.

Learning from Our Words

So there can also be analysis based on our actual spoken words. Already, there's the ability to analyze large swaths of text for sentiment and personality traits, but what if this could be done at such a large scale that the accuracy increases? What if our machines know which buttons they can start pushing to influence us?

Think of a massive A/B test to measure the psychology of influence. One test group gets a particular mood in their response and the other group, something different. Does it affect compliance or mood? What if this is done over and over and eventually we have a machine that knows how to influence us better than we can influence each other. Can the lessons from this learning by used to help us improve our own communications and influence?

Such an analysis can also create results for tests like Myers-Briggs and tailor recommendations to us based on this analysis.

You Look Old

While voice interaction is just one way systems will glean information on us, the other is through visual interaction and analysis. Already, systems like Bing's Cognitive Cloud can assess gender, age, and mood. This will likely hit human-levels of accuracy over the next two years and then surpass us after. There will be no poker face that you can use against the machine to deceive it.

This technology might become so light weight that it gets embedded in camera chips to process and provide this information along with the video stream. We might also see this same technology come up in our Skype, Facetime, or other conference call technologies. You might get a message from your AI assistant saying something like "let's talk about John after you're done your call with him" and then the AI assistant proceeds to warn you about John's latest mood changes so you can intervene.

These visual analysis tools will make the other analysis tools even more accurate and better than our own abilities. Speech recognition could also be improved with the addition of lip reading, similar to HAL's abilities. Already, we've seen this technology demonstrated with high accuracy. LipNet is one example of this.

[embed]https://www.youtube.com/watch?v=fa5QGremQf8[/embed]

Putting It All Together

Where will the real opportunities exist when the machines are better at knowing about us than we are of each other? With all of the variables that will be available in interaction, the big challenge will be in formulating a proper response to the input. Think of a massive matrix with responses based on mood, time, history, activities, gender, preferences, etc. There will be no way of creating pre-programmed responses for all scenarios. So what's a machine to do?

The first opportunity that developers can start working on is natural language generation (NLG). NLG will allow for information from many sources to be spoken back in human language as opposed to just a reading of values. This will allow for much more information to be available to AI assistants to use in responding back to us.

Tailoring responses will include adjusting the prosody of text to speech synthesis. The lowest hanging fruit is to start by mimicking the cadence, rate, and accent of the user. Then, the system would slowly adjust the mood back to one that's desired to influence the user.

A spookier application is for the text to speech engines to mimic us completely. How would it feel to hear your own voice, in robot form, respond to you? It will require fewer audio samples for future versions of speech synthesis to come up with new voices. This might mean that the AIs can now make calls on our behalf to get things done for us... if we permit it.

VaynerMedia put together a composite voice of all of its staff recently called Vee and we'll see lots of similar ideas over the coming years:

[embed]https://www.youtube.com/watch?v=Q1dyz5-M2yk[/embed]

Warnings

With the power of new AIs influence during our interactions, they will likely end up knowing us better than we know each other, so we will need to make sure we are able to control a few things:

Alignment of values. Ensure that the AI wants us to be better as *we* define better, not it.
Transparency. Can we know not only what information our devices collect on us but what their intents are and how they reach certain conclusions?
Authenticity. If we give permission to an AI to make calls impersonating us, how can make sure it won't go rogue? How can we prevent others from abusing or hijacking our AI?

After all, maybe your mother knew her AI was calling... maybe she didn't?