Thinking About Adding Voice Capabilities to Your Product? Consider This

Consider these trends and implications before you integrate voice into your product.

711

Over the next five years, we’re likely to see dozens of devices in our homes embedded with microphones. Today, the Echo, Home, and HomePod are gaining a lot of prominence and third party hardware makers such as iHome and iDevices have announced devices that will embed Alexa and Google Assistant in their products. The embedding of these AI assistants has spilled beyond the smart speaker to many different connected devices.

Hardware makers and brands that haven’t yet implemented voice interaction into their products are asking themselves whether they too should be introducing this capability. The calculation is between whether voice is a “passing fad” or whether the cost of adding voice will provide enough incremental value to maintain market share or the price of the product.

Beyond just considering the added value of enabling customers to talk to their products, these brands may also consider the additional opportunities that could come from using the same technology to enable voice, such as gathering useful analytics on the environment, detecting sound events, or even unifying different AI assistants.

While voice interaction today conjures up standalone products that are similar to the Echo, we’re seeing the shrinking of these devices and the embedding of voice into new products. Light switches, ceiling lights and fans, thermostats, and other surface mounted electronics have been announced or will soon come to market with voice interaction. In addition to these, common appliances like microwaves, ranges, fridges, washing machines, and alarm clocks are all targets for integrating voice.

Today, in considering implementing voice and its supporting technologies (microphones, digital signal processing chips, and software), hardware makers can plan for at least three possible applications:

  1. Controlling the device. This usually involves short commands such as “turn on” or “turn off” or “volume up”. Such interactions might not require the device to have an Internet connection or even an application processor to manage the speech recognition. The device can run in a lower power mode using dedicated hardware.
  2. Integration with an AI assistant. This scenario creates an endpoint for Alexa Voice Service, Google Assistant, or potentially another end-to-end voice assistant.
  3. Gathering analytics or sensing the environment. Microphones can capture much more than just voice commands – it’s possible to identify sounds in the environment or even just to measure the sound level and create rules around volume.Driving the ability to add voice interaction are advances in four areas: Local speech recognition, far field digital signal processing, new speech recognition APIs and voice services, and microphone technology.

Local speech recognition can take the form of wake word technology, always listening for a specific command and being able to start a voice recognition based on that. It can also take the form of larger vocabulary interaction.

While only a handful of companies have been working in this area over the past decade, there are now at least a dozen companies that have released their own technologies. Some of these companies require engagements to train the wake word or local commands while others offer self-serve portals for uploading training data.

Of areas where there have been very large advances, digital signal processing (DSP) for far field interaction technology has seen huge strides. DSPs are necessary to process audio to pick up voice at a distance, cancel out ambient noise and separate the voice signal so it can be understood for speech recognition, and ensure that the device can still hear you and respond even when music is playing from it (called acoustic echo cancellation or “barge in”).

Over the past five years, the cost of dedicated DSP chips has come down drastically and is now a third of the cost. Driving this has been a mushrooming of software and hardware providers. Over the next two years, it’s likely that this cost will be driven down even further, with software-only solutions and technology being given away for free by AI assistant companies for use in conjunction with their services.

Companies thinking about adding voice today need to assess whether the decreased cost of far field technology might be further reduced if they wait for the DSP code to be given away for free. However, the drawback might be that taking the free solution limits their flexibility on which applications they can implement on their product and how much they’ll be able to differentiate their product from other endpoints for a particular AI assistant.

Driving further towards ubiquity of voice are new services for voice interaction. Alexa and Google Assistant are provided for free but customizable assistants are also coming down in both cost and increasing in accuracy. The ability to create a customizable and branded AI assistant around a particular domain can offset the risk diluting a brand if it’s offered alongside a popular AI assistant. Amazon Lex and Google Cloud APIs make it possible to create, albeit limited, AI assistants with a unique voice.

Lower cost and more accurate microphone technology also makes it more tempting to add voice to hardware. MEMs mics have both been dropping in price and have increased built in processing capabilities. It’s also possible to place them in more orientations and there are more options for microphone packages.

Some of the potential drawbacks that manufacturers might need to consider include an increase in bill of materials costs, additional risk and liability with security, and potential erosion of their brand power. Going down the path of voice means investing in a product line that will include this capability in later versions of the product. Tooling, lead times, and certification will all be affected by the decision to add voice.

Companies looking to add voice might also need to consider the potential legal ramifications of doing so. Companies that don’t take precautions to prevent hacking of the device can open themselves up to liabilities. The addition of physical mute buttons and visual indicators of listening are important to prevent tampering. It’s also critical that any information sent from the device is encrypted and that proper authentication is put in place to prevent others from hijacking users’ devices.

The other consideration for companies is how adding voice can affect their brand. If the company is not adding their own branded voice assistant, they might need to consider whether a user calling their device “Alexa” will erode their brand power.

If a company does go down the path of adding voice interaction, there are corollary benefits and opportunities that they can consider.

When multiple devices can listen to an individual, it can help with localizing a user or multiple users and being able to determine presence and even run its own beam-forming algorithm to eliminate noise from an audio signal.

Another benefit is the ability to pass a session from device to device. Imagine if you could initiate a conversation with Alexa on one device and continue it on another as the user walked around a home. This application is possible when multiple devices run the same integration.

With microphones, it’s also possible to better understand how users are interacting with a piece of hardware. First, are they actively using the device? How often? What types of requests are they making of the device? With voice interaction, it becomes possible to discover what these requests are that will enable greater engagement with end users.

When it comes to deciding whether to add voice, manufacturers should look at voice as they would other inputs – as part of a “complete breakfast” that could include touch, gesture, and other control interfaces. Beyond making users’ lives easier in controlling the product, it can also be a way of opening up the silos of the various AI assistants by unifying them with a single device and win over users by freeing up their information from any one AI assistant.