a black Amazon Echo with a light going from blue to purple

��

"Alexa, How Do
You Work?"

Villanova experts shed light on the inner workings��
of voice assistant technology and how it has evolved��

BY COLLEEN DONNELLY

Voice assistants go by many names—Alexa, Siri, Bixby and Google Assistant, to name just a few. And they take on many forms—as a built-in feature of an ever-increasing list of devices, including smartphones, computers, tablets, smart speakers, gaming devices, TV remotes and even vehicles.

As the availability and capability of conversational artificial intelligence (AI) has grown over the past decade, more and more users have come to rely on voice assistants to search for information, execute tasks on their behalf and quickly answer any number of questions. In 2022, an estimated 142 million Americans—nearly half the country’s population—used voice assistants at least once a month. So how exactly do they work?

Villanova Magazine turned to four experts on the Villanova faculty to explain the technology behind voice assistants, how far these tools have come, where they are headed and the implications for users’ privacy.

MEET THE EXPERTS

Stephen Andriole, PhD

Thomas G. Labrecque Endowed Chair in Business, Villanova School of Business

Dr. Andriole teaches artificial intelligence, machine learning and generative AI, and his areas of research include automation, digital transformation and business technology strategy. An industry and government consultant on all aspects of digital technology, Dr. Andriole is also a go-to media source for all things related to the future of technology and business. He is the author of the recent The Digital Playbook: How to Win the Strategic Technology Game.

��

Grant Berry, PhD

Assistant Professor of Spanish Linguistics,��College of Liberal Arts and Sciences

Before joining the faculty at Villanova in 2020, Dr. Berry worked with Amazon Alexa as a language engineer, technical program manager and applied scientist, developing new features for Alexa and launching Alexa in new languages. As director of the Language Use and Variation Lab in the Cognitive Science Program, he continues to research and investigate the relationship between linguistics, language technology optimization and machine learning.

Xue Qin, PhD

Assistant Professor of Computing Sciences, College of Liberal Arts and Sciences

��

Earlier this year, Dr. Qin received a grant from the National Science Foundation to investigate how test code can be adapted to develop voice assistant features in mobile applications. She teaches courses in applied machine learning and algorithms and data structures. Her research interests focus on software engineering, privacy and security and natural language processing.

Brett Frischmann, JD

Charles Widger Endowed University Professor in Law, Business and Economics, Charles Widger School of Law

A renowned scholar in intellectual property and internet law, Professor Frischmann is a leading source on issues related to surveillance, technology policy and intellectual property. He teaches interdisciplinary courses at the intersection of law, economics, business, ethics and technology and is also an affiliate scholar at Stanford Law School’s Center for Internet and Society.

The Rise of the Voice Assistant

A voice assistant is a digital application or device that interprets and responds to spoken commands or questions from users.

“The technology has evolved rapidly and become increasingly sophisticated,” says Xue Qin, PhD, assistant professor of Computing Sciences. “The popularity of voice assistants has skyrocketed in recent years, with users able to perform tasks faster and more efficiently through simple voice commands.

These improvements in conversational AI have led to expanded applications for voice assistants, including:

Accessibility (providing an alternative method of accessing information and performing tasks for those with visual impairments or limitations in mobility)
Information retrieval (playing music, making a phone call, checking news and weather)
Home automation (turning on lights, adjusting the temperature)
Shopping (adding items to a shopping list, purchasing services and goods online)
Health and wellness (guided meditations, exercise routines and reminders to take medication)

Voice assistants now have the ability to lend hands-free support in nearly every aspect of daily life—but it didn’t happen overnight.

“The concept of a disembodied voice that can interact with and complete tasks for the device owner isn’t new,” says Grant Berry, PhD, director of the Language Use and Variation Lab and an assistant professor of Spanish Linguistics. “Whether you’re talking about Rosie the Robot on The Jetsons or the computer on Star Trek: The Next Generation, voice assistants have been part of popular culture for decades. It’s only in the last 15 years that they’ve moved from science fiction to reality.”

The technology behind voice assistants has been in development much longer. Decades before Alexa, Siri and Google Assistant became household names, there was IBM Shoebox—the very first digital speech recognition tool. Released in 1961, it was able to recognize 16 words and digits. It’s a far cry from the capabilities of today’s voice assistants, but it laid a solid foundation for the technology now available.

The real turning point for voice assistants came in 2011 when Apple added Siri to the iPhone 4s. For the first time, millions of people had access to a voice assistant right in the palm of their hand. “Since then, functionality has improved, integrations with third-party apps have increased and applications in various industries have broadened,” Dr. Berry says.

Just Say the Word

Voice assistants don’t typically take very long to fulfill a spoken command—but there are quite a number of steps and software components at work to make that happen, namely automated speech recognition and natural language processing technologies.

Natural language processing is an umbrella term for two key areas: natural language understanding (hearing) and natural language generation (speaking).

Dr. Qin explains the key points of the process like this: The voice assistant listens via a microphone for its wake word (a phrase that lets the device know a request is coming); “translates” the user’s spoken command or question to text through automated speech recognition and natural language understanding; performs tasks by executing predesigned code; and talks back by using AI technology called neural text-to-speech (a form of natural language generation).

Dr. Berry has an insider’s knowledge of the inner workings of voice assistants. Before joining the faculty at Villanova in 2020, he employed his skills in linguistics and understanding language variation as a language engineer, technical program manager and applied scientist for Amazon Alexa. In addition to supporting the launch of Alexa in Hindi, Portuguese and Arabic, Dr. Berry worked on household-related applications of the technology and interfaces with smart home devices.

“Even a simple command like ‘turn off the light’ requires a lot of different levels of understanding that we may often take for granted,” says Dr. Berry. The voice assistant has to understand:

What a light is
What “off” is (and alternately, what “on” is)
That “turn” is a verb that can mean “to rotate”, but also means “to initialize” when it’s paired with “on”
Which light the user is referring to
Regional variations of “turn off the light” (e.g., “turn out the light,” “shut the light,” and “close the light”)

“When it comes to developing these programs, the language engineers have to think about what users are intending and all of the different possible ways they could get to that intention, and that’s not a trivial task,” Dr. Berry says. The voice assistant’s natural language program has to be robust enough to filter out background noise and support quite a bit of variation in voices—including languages, dialects, accents, ages, genders, regional phrasing, pitch and volume.

Once the voice assistant understands what the user is asking for, it uses speech-to-text conversion software to enter that request into the system.

If it doesn’t understand the question or it needs more information to fulfill the task, the voice assistant will formulate follow-up questions and use text-to-speech software to ask for clarifications or more specifics. “Neural text-to-speech technology gives each of these assistants a voice, synthetic speech created from millions of curated training examples,” Dr. Berry explains.

With the necessary information gathered, the voice assistant then answers the question by searching the internet or executes the task by connecting to built-in applications, like a calendar or clock, or authorized third-party applications, like subscription-based streaming services or even bank accounts.

Smarter Every Day

Through each interaction, the voice assistant becomes better at understanding requests and providing more accurate responses. Sometimes a voice assistant will ask, “Did I answer your question?” or “Was that what you were looking for?”��

“When I say, ‘yes,’ that’s machine learning in action—I’m training the voice assistant,” explains Stephen Andriole, PhD, Thomas G. Labrecque Endowed Chair in Business. “The voice assistant continues to expand its knowledge base with data it’s getting from me and millions of other users.”

And it’s not just learning more about speech recognition—it’s also learning more about the user. “It starts to get to know me personally, how I like to be communicated with, how I like to be addressed,” Dr. Andriole says. “What’s going to happen with these interfaces is that they will begin to interpret my intentions based on the questions I ask or the requests I make, and they will become anticipatory.”��

For instance, let’s say a user asks every weekend about the weather at a local fishing spot, as well as surf reports and tide tables. After several interactions, the voice assistant may correlate these requests with the user’s intention: to go fishing on the weekends at a particular location. And it may offer to package this information into a report that it delivers every Saturday morning.��

“That’s the kind of voice interaction that’s prompted by data that’s collected over a period of time,” Dr. Andriole says. Many users already experience this type of interaction if their voice assistant is integrated with a shopping application. It’s watching what the user buys and when, and is able to infer when they’re likely running low on something. Then it prompts the user: “I noticed that you may be running low on oatmeal. Would you like me to order that for you?”��

“Phase one is the voice assistant providing simple answers to my questions, and phase two is when it starts to use that data to interpret and understand my intention and purpose—that’s when it becomes more valuable to me,” Dr. Andriole says. “It’s akin to Netflix profiling me in terms of the films and TV shows I like — it’s watching what I’m watching to make more helpful, useful recommendations for me.”��

There is, however, a trade-off involved: convenience at the expense of privacy. “How much of their privacy will users be willing to sacrifice for this kind of convenience?” Dr. Andriole asks. “That depends almost entirely on how useful and helpful it is for them.��■