ACM

Communications of the ACM

Home/Magazine Archive/September 2000 (Vol. 43, No. 9)/Speech Interfaces from an Evolutionary Perspective/Full Text

Speech Interfaces from an Evolutionary Perspective

By Clifford Nass, Li Gong
Communications of the ACM, September 2000, Vol. 43 No. 9, Pages 36-43
10.1145/348941.348976
Comments

View as: Print Mobile App ACM Digital Library Full Text (PDF) Share:

What makes humans different from all other animals? There are few better answers than our ability to recognize and produce speech. Speech is so much a part of being human that people with IQ scores as low as 50 or brains as small as 400 grams fully comprehend and speak at a competent level [11].

Similarly, the human head and brain are uniquely evolved to produce speech [4, 11]. Compared to other primates, humans have remarkably well-developed and controllable muscles around the lips and cheeks. Indeed, more of the motor cortex (particularly Broca's area) is devoted to vocalization than to any other function, in sharp contrast to every other animal, including primates. Only homo sapiens can use the tongue, cheeks, and lips, together with the teeth, to produce 14 phonemes per second [11]; even the Neanderthals, who are known for having large brains and elaborate cultural and social behaviors, could not sustain speech due to the structure of their breathing apparatus [4].

Modern humans are also exquisitely tuned for speech recognition. Infants as young as one day old show relatively greater left hemisphere electrical activity to speech sounds and relatively greater right hemisphere activity to non-speech sounds [11]; by 22 days old, infants exhibit the adult tendency for right-ear (and left-brain-hemisphere) dominance for spoken sounds (regardless of language) and left-ear (and right-brain-hemisphere) dominance for music and other sounds [11].

Why is the fundamental and uniquely human propensity for speech so important in the design of computerized speech systems? The straightforward answer is that the best approach for creating systems that comprehend or produce speech is to model human physiology and cognition. Understand how humans process verbal input, the logic goes, and one has an extraordinarily effective and accurate speech-recognition system. Similarly, model the human vocal apparatus, and one has a system that can fluently produce any language, from the clicks of the !Kung San, to the tonal languages of China, to the enormous variety of English pronunciations. This approach to the problem of designing speech systems has emerged as the dominant paradigm for design, especially in speech-output systems.

Social Aspects of Speech

The more provocative and less-explored consequences of the uniqueness of human speech come from a socio-evolutionary, rather than an engineering, orientation. For homo sapiens, the most important predators, prey, and mating opportunities are other humans. Having encountered a definitive marker of humanness, such as speech, a person's optimal strategy is to preserve limited cognitive capacity and leap to the conclusion that one has encountered a person. Hence, when humans hear speech or find themselves speaking, they immediately change their processing strategy from "Is this a person?" to "What should I know about this person?" and "How should I respond?"

Throughout evolutionary history, this strategy has been optimal. Because nothing else could produce or process speech, anyone pondering the question of "human or not" missed opportunities and encountered problems that were effectively responded to by those quick-accepting people who won the right to be our ancestors. Few decision rules would have greater influence on humans today than one that has been highly accurate and important in 100,000 years of homo sapiens evolution: Speech equals human [4].

For better or for worse, the late twentieth century renders this formula invalid. Computer-based voice interfaces for input and output have become commonplace. From the simple tasks of hearing pre-recorded wake-up calls and telling a toy car to go "left" vs. "right," voice-input and voice-output systems are now used to order stocks, manage personal appointments, control one's car, dictate text into a word processor, entertain children, and a host of other tasks. Graphical user interfaces are now being complemented by and replaced by voice user interfaces. The trend toward "ubiquitous computing" necessitates a voice component for hands-free and eyes-free computing, small displays, the disabled, and non-literate populations, including children.

What does the human brain do when confronted by speech technologies? Because human-directed responses are automatic and unconscious [10] and require continuous reminders to be extinguished, humans do not have the wherewithal to overcome their fundamental instincts concerning speech. Thus, individuals behave toward and make attributions about voice systems using the same rules and heuristics they would normally apply to other humans.

Linking Evolutionary Psychology and Speech Interfaces

It may be difficult to accept this claimand even more difficult to feel comfortable designing voice-based computer systems according to itbecause it seems to violate our fundamental intuitions. For instance, in the first author's experiments with thousands of participants, no one has ever said, "Of course I treat the computer like a person. After all, it talks to and understands me." Hence, reliance on people's conscious beliefs requires the conclusion that evolution is irrelevant, voice is merely a technological feat, and voice-based computers are fundamentally machines.

To establish that speech always leads to responses associated with human "interactants," we must take an indirect approach. Specifically, we adapt the prescription of The Media Equation's four steps [10]:

Identify what is known about how the brain evolved to produce and understand speech.
Draw conclusions about how individuals respond to other people because of the first step.
Have experimental participants interact with a speech system, rather than a person, and determine whether the participants exhibit the same attitudes and behaviors described in the second step.
Draw out implications for design and highlight open questions.

In what follows here, we discuss a wide range of research on user interactions with voice interfaces in terms of evolutionary psychology and this four-step methodology. The review is not, of necessity, exhaustive. Our goal is to highlight the range of possibilities.

Evolutionary principle 1. A primary goal of speaking is to be understood. While words may have evolved to aid thinking, spoken language evolved to facilitate interaction with other humans [11]. Although people are evolved to comprehend speech, we all have experienced interactions in which another person had difficulty understanding what we were trying to say. This "at-risk" listener may not be proficient in the particular language we are using, may have difficulty hearing, or may be distracted, among other inhibitors. Given the desire to be understood and the exquisite control humans have over their speech output, humans use "hyperarticulate" speech (such as increased pauses and elongated words) when encountering people with comprehension difficulties.

In a beautiful set of studies, Sharon Oviatt and her colleagues at the Oregon Graduate Institute of Science and Technology [9] demonstrated that computers with poor recognition elicit the same adaptive behaviors. They presented users with a system that varied in its recognition failure rate. When the system failed to recognize an utterance, users were instructed to repeat the utterance until they were understood. Consistent with an evolutionary perspective, participants exhibited the same adaptations humans use with non-comprehending human listeners. Specifically, when the system failed, participants not only changed the duration and amplitude of speech; during high error rates, they exhibited fewer "disfluencies," such as "um" and "uh," and hyperclear phonological features, such as speaking with precise diction [9].

Individuals behave toward and make attributions about voice systems using the same rules and heuristics they would normally apply to other humans.

Design implication. Speech recognition systems that learn may not improve steadily, no matter how effective are their adaptive algorithms.

Imagine a system that starts out with a high error rate. The user produces hyperarticulate speech the system then begins to adapt to, reducing its error rate. The user, recognizing the system is improving its recognition rate, then adopts less hyperarticulate speech. This change in communication style temporarily increases the error rate (because the user is changing his or her speech patterns), leading the system to adapt to the new speech patterns. Although the system and user eventually stabilize, there are inevitable fits and starts.

Open question. Will other markers of at-risk listeners, such as the computer's accent in a speech input/output system, lead human users to hyperarticulate in systematic ways?

Evolutionary principle 2. Humans use speech only when their interactant is physically close. Unlike in many other animals, the human voice does not carry very well. Thus, when one uses speech, evolution would suggest that the other person is physically proximate.¹ Physical proximity is important, because when another person, especially a stranger, is close, he or she presents the greatest potential for harm (and opportunity); hence, one is cautious in one's interactions.

To determine whether speech input to a computer would elicit feelings of proximity and thereby elicit caution, a 1997 study by Nass and Gorin at Stanford University had participants provide responses to a series of questions for which the socially desirable answer and the truthful answer conflicted (such as "I never lie"). Consistent with the idea that speech suggests social presence, participants provided significantly more socially appropriate and cautious responses to the computer when the input modality was voice as compared to mouse, keyboard, or handwriting. (Handwriting allowed us to rule out novelty of the technology as an alternative explanation for these results).

Design implication. Criticism from the computer, regardless of the mode of input, is much more disturbing in speech-input systems compared to other forms of input.

Design implication. Social errors by the computer, regardless of the mode of output, are much more consequential for speech input compared to other forms of input. Negative comments and behaviors are processed more deeply and lead to greater arousal than positive comments [10]. When combined with presence, these negative behaviors seem yet more threatening and consequential.

Evolutionary principle 3. Humans distinguish voices according to gender. For purposes of reproduction, the second most important determination (after knowing something is a person) is a person's gender. Gender (and sex) is so important that instead of being encoded in terms of the canonical representation of voices, the distinction of male and female in speech is accomplished via unusually detailed and complex auditory psycho-physical processes involving fundamental frequency, formant frequencies, breathiness, and other features [5].

Gender is, of course, irrelevant to computers; indeed, it would be difficult to imagine anything more sexless than a computer screen. However, a recent study challenged this idea [8]. During the experiment, participants were tutored by a (prerecorded male or female) voice-based computer on a stereotypically "male" topic ("computers and technology") and a stereotypical female topic ("love and relationships"). The evaluator computer, using a different prerecorded female or male voice, positively evaluated the tutor's performance. Consistent with the research on gender-stereotyping, participants exhibited the following stereotypes toward computers:

The male-voiced computer received more credit for praise than did the female-voiced computer;
Praise from the male-voiced computer was more convincing than the female-voiced computer; and
The female-voiced tutor computer was rated significantly more informative about love and relationships and less informative about technology than the male-voiced tutor computer [8].

In post-experiment debriefings, all participants indicated that male-voiced computers were no different from female-voiced computers and that engaging in gender-stereotyping with respect to computers is ludicrous.

Another study used a more subtle manipulation of gender [10]. Participants viewed depictions of six women whose voices were electronically altered to be "feminine" (cut-off of low frequencies, boosted volume of high frequencies, and lowered volume) or "masculine" (the opposite). The voice cues led to numerous gender stereotypes, discounting the fact that all the faces were female. The women with masculine-sounding voices were perceived as having significantly more drive, willpower, reasoning skills, persuasiveness, learning ability, and extroversion than those with female-sounding voices [10].

Design implication. The choice of a particular recorded voice is highly consequential; "casting" is as important in speech interfaces as it is in traditional media.

Design implication. Gender stereotypes should be taken into account, though not necessarily conformed to, when designing interfaces.

Open question. What other demographic characteristics would be recognized by users, and what stereotypes would be elicited by these demographic characteristics?

Evolutionary principle 4. Humans have a very broad definition of "speech"extending to computer-synthesized speech. Human languages seem remarkably diverse. Many languages have sounds that native speakers from other languages cannot distinguish or produce. For example, native English speakers find it difficult to discriminate the multiple tones associated with the same Chinese syllable, such as /ma/. Conversely, the word "shibboleth," a marker of belonging, comes from Old Testament Jephthah's means for distinguishing the fleeing Ephraimites (who could not pronounce "sh") from the Gileadites (who could).

With that said, the human auditory apparatus is focused primarily on distinguishing speech from all other sounds. For example, "dichotic listening tasks" (simultaneous playing of different information to each ear) have demonstrated that the right ear (hence the left hemisphere) has an advantage in processing not only one's native language, but nonsense syllables, speech in foreign languages, even speech played backward.

Given this broad definition of and special attention to speech sounds, it is not surprising that humans do quite well processing even the very poor quality text-to-speech output produced by computers [7]. However, the bizarre pronunciations and cadences of even the best text-to-search engines seem to be a constant reminder that a speaking computer is something not human. The more interesting question is whether people overlook the constant reminders of non-humanness and process the same secondary characteristics, such as gender, emotion, and personality, associated with evolutionarily relevant vocal patterns among human speakers.

Gender. Consistent with the evolutionary perspective we've outlined here, as well as the third principle, a recent study demonstrated that even unambiguously synthesized speech elicits "gendered" responses [3]. Participants in the study worked with a computer employing either a synthesized female voice (F0=220Hz) or a synthesized male voice (F0=115Hz). They were confronted with a series of choice dilemmas; for each, the computer would argue for one option over the other. Following gender stereotypes, participants accepted the "male" computer's suggestions significantly more often than those of its "female" counterpart. Remarkably, there were also social identification effects, including a significant crossover interaction with respect to trustworthiness and attractiveness, such that people preferred their own (clearly non-human) gender [3].

Emotion. If gender is the fundamental morphological aspect of humans, then emotion may be the fundamental affective aspect. Indeed, when people are asked what distinguishes humans from computers, "emotion" leads the list. Recognition of emotion in speech is localized in the right hemisphere, in contrast to the processing of words, which is left-hemisphere dominant [2].

Studies have demonstrated that both major methods of text-to-speech synthesis are capable of carrying emotional tone. Concatenated speech realizes emotion by having the reader dictate text with a particular emotional tone; chopping the words into phonemes and reassembling the speech segments nonetheless retains the original emotional sense. On the other hand, systems based on acoustic modeling can manipulate the same parameters that are relevant in detection of emotion in human speech. For example, Janet Cahn [1], a student at the MIT Media Lab, enabled users to discriminate anger, disgust, fear, joy, sadness, and surprise by varying text-to-speech parameters, such as pitch, timing, voice quality, and articulation.

When people are asked what distinguishes humans from computers, "emotion" leads the list.

Personality. Gender and emotion are primitive constructs found in animals, as well as in people. To truly demonstrate the power of evolutionary psychology to predict user attitudes and behaviors with respect to synthesized speech, one study [7] turned to a complex and high-order construct: personality. That research suggested that humans use at least four cues to detect extroversion or introversion in another person's voice: fundamental frequency, frequency range, amplitude, and speech rate. These parameters were manipulated in the context of a book-selling Web site presenting book descriptions and asked participants about their opinions of the books. The descriptions were presented by either an "introverted" or "extroverted" voice to both extroverted or introverted participants. Consistent with the well-documented finding that individual people are attracted to others similar to themselves, participants who heard a voice matching their personalities were more likely to buy the book and trust the book descriptions and the reviewer more. They also liked listening to the voice and the reviews more than those who heard only a mismatched voice [7].

Interestingly, the match between a person's voice and the text-to-speech voice was irrelevant; it was the participants' actual personalities that were definitive [7]. Thus, similarity lies in the cognitive construct of personality, rather than in a simple decomposition and matching of features of the voice.

Design implication. The choice of a particular synthesized voice is a casting issue.

Design implication. Because of its low cost and clear value, systems using text-to-speech synthesis should include multiple voices when possible, varying them by gender, emotion, and personality, among other characteristics.

Open question. Should the particular text-to-speech voice be assigned to the user, or should the user be able to select the voice from an array of options?

On the one hand, there is a general design philosophy in human-computer interface design that "choice is good," and cognitive dissonance theory suggests that people tend to like the voices they choose (although there is also some evidence for "buyer's remorse"). On the other hand, there is no empirical evidence that people select voices that match their own personalities, fit gender stereotypes, or express the appropriate emotional tone, all of which would tend to increase satisfaction.

Open question. What are the consequences of synthesized speech systems that refer to themselves as "I"?

While humans are clearly willing to assign social characteristics to synthesized speech, there is no evidence as to whether the use of the term "I," the Cartesian claim of personhood, will seem natural or be jarring, reminding users, by contrast, they are not dealing with a real live person.

Evolutionary principle 5. Humans are very effective at discriminating among voices. One of our most remarkable abilities is the "cocktail party" skill; confronted with an array of simultaneous voices, humans "tune into" and process one voice, even if the person is moving throughout the room. Voice discrimination also allows us to successfully listen to the radio and interact with others over a speaker phone. This ability is part of the evolutionarily advantageous strategy of being able to quickly determine who and how many people one has encountered. This skill is so important that the brain uses both featural analysis (in the left hemisphere) and pattern recognition (in the right hemisphere) for distinguishing one unfamiliar voice from another [12].

Because computers are not constrained by the acoustic structures defining each person, a single computer can produce a multitude of voices. Thus, it would seem illogical for humans to respond to different voices on the computer as if they were different "people." However, this idea is challenged by a 1993 study that had users work on a voice-based task with a computer [10]. After they completed the task, study participants were presented with a voice-based questionnaire asking about the performance of the first task. Consistent with the rules of politeness and the idea that different voices mark different social actors, study participants provided significantly more positive feedback to the voice that asked about "itself" than when a different voice asked the questions [10]. Participants denied being influenced by the variety of voices.

Just as one computer can have multiple voices, multiple computers can have the same voice. In a study exploring whether voice would be a unifying, as well as differentiating, feature of interfaces, participants were presented with a voice-based computer-tutoring session and a separate voice-based session evaluating the tutor's performance [10]. The two sessions were either on the same or different computers and used the same voice or different voices. Consistent with voice as a marker of identity, study participants thought the computer did better when it was praised in a different voice than when it was praised in the same voice (objectivity versus boasting) [10]. These effects held regardless of whether the voices were on the same or different computers. When a computer has a voice, the box becomes invisible.

Design implication. Voice selection, which can be used to distinguish and/or integrate tasks, is particularly useful when one wishes to suggest that a certain task or object is a "specialist" [10].

Design implication. A narrator voice can credibly praise other parts of a Web site or application.

Open question. Will familiarity with a computer-based voice influence users' processing of that voice? This question is important because we process familiar human voices differently from unfamiliar voices [12].

Evolutionary principle 6. Faces and voices are processed holistically. One of the most influential discoveries in our understanding of how visual speech is processed is the McGurk Effect [4] demonstrating that syllables "sound" different if the lips suggest a different sound. Humans tend to integrate voices and faces when they are combined. People "hear lips and see voices" [4].

Just as computers can synthesize human speech, tools are now available for synthesizing human faces. Like their speech counterparts, these faces are clearly artificial, but they can provide basic facial expressions and very good lip synchronization.

If synthesized faces are merely an enhancing addition to speech-output systems, their inclusion should have only simple, additive effects, regardless of the type of voice. However, if the evolved tendency to cognitively integrate speech and faces applies to computer-based interfaces, there should be significant interaction between the characteristics of speech and the characteristics of the face.

A recent study [4] paired a synthesized face with synthesized speech (or recorded human speech) and compared these interfaces to voice-only interfaces without faces [6]. Participants were asked a series of personal questions, some open-ended (such as "What do you dislike about your physical appearance?") and some true/false (such as "I never break the law"); they entered their answers via keyboard and mouse. Consistent with the holistic processing model, there were significant crossover interactions; synthesized speech with a synthesized face and a recorded speech without a face elicited significantly greater depth and breadth of disclosure, greater impression management, and deeper thought than both synthesized speech alone and recorded speech with a synthesized face, respectively [6]. These findings suggest that people may be concerned about consistency between faces and voices.

Design implication. Making each modality as "human-like" as possible does not necessarily lead to a desirable (or even human-like) experience. Although recorded natural speech is clearly more human than text-to-speech output, when accompanied by a synthetic face, an interface actually becomes less social and more appealing. Thus, consistency between different modalities of the interface may be more important than maximizing each modality independently.

Open question. Which implementation of each possible modalityface, body, motionis optimal for systems using speech synthesis? Recorded speech? Speech input?

Final Thoughts

Traditionally, when discussion of interface design implicates the human brain, one expects the conversation to focus on cognitive issues, such as recognition rates, extent of in-grammar utterances, and prosody versus clarity. We've sought to provide initial evidence that a detailed understanding of how the human brain evolved for speech production and recognition leads to a wide range of experimental results and design implications implicating novel and important social aspects of speech interfaces (though much more is known about speech and the human brain, as well as the numerous consequences, than we presented here). Ironically, it is those of us most interested in the "soft" side of speech interfaces who would benefit most from understanding the "hardware" carried around in every person's head.

References

1. Cahn, J. The generation of affect in synthesized speech. J. Americ. Voice I/O Soc. 8 (1990), 119.

2. Denes, G., Caldognetto, E., Semenza, C., Vagges, K., and Zettin, M. Discrimination and identification of emotions in human voice by brain-damaged subjects. Acta Neurolog. Scandin. 69 (1984), 154162.

3. Lee, E.-J., Nass, C., and Brave, S. Can computer-generated speech have gender? An experimental test of gender stereotypes. In Proceedings of CHI 2000 (The Hague, The Netherlands, Apr. 16). ACM Press, New York, 2000.

4. Massaro, D. Perceiving Talking Faces: From Speech Perception to a Behavioral Principle. MIT Press, Cambridge, Mass., 1997.

5. Mullenix, J., Johnson, K. Topcu-Durgun, M., and Farnsworth, L. The perceptual representation of voice gender. J. Acoust. Soc. Amer. 98, 6 (1995), 30803095.

6. Nass, C. and Gong, L. Maximized modality or constrained consistency? In Proceedings of AVSP'99 (Santa Cruz, Calif., Aug. 79). University of California, Santa Cruz, 1999, 15.

7. Nass, C. and Lee, K. Does computer-generated speech manifest personality? An experimental test of similarity-attraction. In Proceedings of CHI 2000 (The Hague, The Netherlands, Apr. 16). ACM Press, New York, 2000.

8. Nass, C., Moon, Y., and Green, N. Are computers gender-neutral? Gender stereotypic responses to computers. J. Appl. Soc. Psych. 27, 10 (1997), 864876.

9. Oviatt, S., MacEachern, M., and Levow, G.-A. Predicting hyperarticulate speech during human-computer error resolution. Speech Commun. 24, 2 (1998), 87110.

10. Reeves, B. and Nass, C. The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge University Press/CSLI, New York, 1996.

11. Slobin, D. Psycholinguistics, 2nd Ed. Scott, Foresman and Co., Glenview, Ill., 1979.

12. Van Lancker, D. and Kreiman, J. Voice discrimination and recognition are separate abilities. Neuropsych. 25, 5 (1987), 829834.

Authors

Clifford Nass ([email protected]) is a professor in the Department of Communication at Stanford University, Stanford, CA.

Li Gong ([email protected]) is a Ph.D. student in the Department of Communication of Stanford University, Stanford, CA.

Footnotes

¹Thus, it is reasonable to describe a conversation on a telephone as "reaching out and touching someone."

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

No entries found