The computation speed and control methods needed to portray 3D virtual humans suitable for interactive applications have improved dramatically in recent years. Real-time virtual humans show increasingly complex features along the dimensions of appearance, function, time, autonomy, and individuality. The virtual human architecture we've been developing at the University of Pennsylvania is representative of an emerging generation of such architectures and includes low-level motor skills, a mid-level parallel automata controller, and a high-level conceptual representation for driving virtual humans through complex tasks. The architecturecalled Jackprovides a level of abstraction generic enough to encompass natural-language instruction representation as well as direct links from those instructions to animation control.
Only 50 years ago, computers could barely compute useful mathematical functions. About 25 years ago, enthusiastic computer researchers were predicting that game-playing machines and autonomous robots performing such surrogate functions as mining gold on asteroids were in our future. Today's truth lies somewhere in between. We have balanced our expectations of complete machine autonomy with a more rational view that machines should assist people in accomplishing meaningful, difficult, and often enormously complex tasks. When such tasks involve human interaction with the physical world, computational representations of the human bodyvirtual humanscan be used to escape the constraints of presence, safety, and even physicality.
Why are real-time virtual humans so difficult to construct? After all, anyone who can watch a movie can see marvelous synthetic animals, characters, and people. But they are typically created for a single scene or movie and are neither autonomous nor meant to engage in interactive communication with real people. What makes a virtual human human is not just a well-executed exterior design, but movements, reactions, self-motivated decision making, and interactions that appear "natural," appropriate, and contextually sensitive. Virtual humans designed to be able to communicate with real people need uniquely human abilities to show us their actions, intentions, and feelings, building a bridge of empathy and understanding. Researchers in virtual human characters seek methods to create digital people that share our human time frame as they act, communicate, and serve our applications.
Still, many interactive and real-time applications already involve the portrayal of virtual humans, including:
Along with general industry-driven improvements in the underlying computer and graphical display technologies, virtual humans will enable quantum leaps in applications normally requiring personal and live human participation. The emerging MPEG-4 specification, for example, includes face- and body-animation parameters for real-time display synthesis.
Building models of virtual humans involves application-dependent notions of fidelity. For example, fidelity to human size, physical abilities, and joint and strength limits are essential to such applications as design evaluation. And in games, training, and military simulations, temporal fidelity in real-time behavior is even more important. Appreciating that different applications require different sorts of virtual fidelity prompts a number of questions as to what makes a virtual human "right": What do you want to do with it? What do you want it to look like? What characteristics are important to the application's success? and What type of interaction is most appropriate?
Different models of virtual-human development provide different gradations of fidelity; some are quite advanced in a particular narrow area but are more limited for other desirable features. In a general way, we can characterize the state of virtual-human modeling along at least five dimensions, each described in the following progressive order of feature refinement:
Different applications require human models that individually customize these dimensions (see Table 1). A model tuned for one application may be inadequate for another. And many research and development efforts concentrate on refining one or more dimensions deeper into their special features. One challenge for commercial efforts is the construction of virtual human models with enough parameters to effectively support several application areas.
At the University of Pennsylvania, we have been researching and developing virtual human figures for more than 25 years [2]. Our framework is comprehensive and representative of a broad multiapplication approach to real-time virtual humans. The foundation for this research is Jack, our software system for creating, sizing, manipulating, and animating virtual humans. Our philosophy has yielded a particular virtual-human development model that pushes the five dimensions of virtual-human performance toward the more complex features. Here, we focus on the related architecture, which supports enhanced functions and autonomy, including control through textualand eventually spokenhuman natural-language instructions.
Other universities pursuing virtual human development include: the computer graphics laboratory at the Swiss Federal Institute of Technology in Lausanne, Georgia Institute of Technology, Massachusetts Institute of Technology Media Lab, New York University, the University of Geneva, the University of Southern California, and the University of Toronto. Companies include: ATR Japan, Credo, Engineering Animation, Extempo, Kinetix, Microsoft, Motion Factory, Phillips, Sony, and many others [3, 12].
Building a virtual human model that admits control from sources other than direct animator manipulations requires an architecture that supports higher-level expressions of movement. Although layered architectures for autonomous beings are not new, we have found that a particular set of architectural levels seems to provide efficient localization of control for both graphics and language requirements. A description of our multilevel architecture starts with typical graphics models and articulation structures, and includes various motor skills for endowing virtual humans with useful abilities. The higher architectural levels organize these skills with parallel automata, use a conceptual representation to describe the actions a virtual human can perform, and finally create links between natural language and action animation.
Graphical models. A typical virtual human model design consists of a geometric skin and an articulated skeleton. Usually modeled with polygons to optimize graphical display speed, a human body can be crafted manually or shaped more automatically from body segments digitized by laser scanners. The surface may be rigid or, more realistically, deformable during movement. Deformation demands additional modeling and computational loads. Clothes are desirable, though today, loose garments have to be animated offline due to computational complexity.
The skeletal structure is usually a hierarchy of joint rotation transformations. The body is moved by changing the joint angles and its global position and location. In sophisticated models, joint angle changes induce geometric modifications that keep joint surfaces smooth and mimic human musculature within a character's particular body segment (see Figure 1).
Real-time virtual humans controlled by real humans are called "avatars." Their joint angles and other location parameters are sensed by magnetic, optical, and video methods and converted to joint rotations and body pose. For movements not based on live performance, computer programs have to generate the right sequences and combinations of parameters to create the desired movements' desired actions. Procedures for changing joint angles and body position are called motion generators, or motor skills.
A virtual human should be able to walk, talk, and chew gum at the same time.
Motor skills. Virtual human motor skills include:
Numerous methods help create each of these movements, but we want to allow several of them to be executed simultaneously. A virtual human should be able to walk, talk, and chew gum at the same time. Simultaneous execution also leads to the next level of our architecture's organization: parallel automata.
Parallel transition networks. Almost 20 years ago, we realized that human animation would require some model of parallel movement execution. But it wasn't until about 10 years ago that graphical workstations were finally powerful enough to support functional implementations of simulated parallelism. Our parallel programming model for virtual humans is called Parallel Transition Networks, or PaT-Nets. Other human animation systems, including Motion Factory's Motivate and New York University's Improv [9], have adopted similar paradigms with alternative syntactic structures. In general, network nodes represent processes and arcs, which connect the nodes, and contain predicates, conditions, rules, and other functions that trigger transitions to other process nodes. Synchronization across processes or networks is made possible through message-passing or global variable blackboards to let one process know the state of another process.
The benefits of PaT-Nets derive not only from their parallel organization and execution of low-level motion generators, but from their conditional structure. Traditional animation tools use linear timelines on which actions are placed and ordered. A PaT-Net provides a nonlinear animation model, since movements can be triggered, modified, and stopped by transitions to other nodes. This type of nonlinear animation is a crucial step toward autonomous behavior, since conditional execution enables a virtual human's reactivity and decision making.
Providing a virtual human with humanlike reactions and decision-making skills is more complicated than just controlling its joint motions from captured or synthesized data. Simulated humanlike actions and decisions are how we convince the viewer of the character's skill and intelligence in negotiating its environment, interacting with its spatial situation, and engaging other agents. This level of performance requires significant investment in action models that allow conditional execution. We have programmed a number of experimental systems to show how the PaT-Net architecture can be applied, including the game "Hide and Seek," two-person animated conversation [3], simulated emergency medical care [4], and the multiuser virtual world JackMOO [10].
PaT-Nets are effective but must be hand-coded in C++. No matter what artificial language we invent to describe human actions, it is not likely to represent exactly the way people conceptualize a particular situation. We therefore need a higher-level representation to capture additional information, parameters, and aspects of human action. We create such representations by incorporating natural-language semantics into our parameterized action representation.
Conceptual action representation. Even with a powerful set of motion generators and PaT-Nets to invoke them, we still have to provide effective and easily learned user interfaces to control, manipulate, and animate virtual humans. Interactive point-and-click tools (such as Maya from Alias | Wavefront, 3D StudioMax from Autodesk, and SoftImage from Avid), though usable and effective, require specialized training and animation skills and are fundamentally designed for off-line production. Such interfaces disconnect the human participant's instructions and actions from the avatar through a narrow communication channel of hand motions. A programming language or scripting interface, while powerful, is yet another off-line method requiring specialized programming expertise.
A relatively unexplored option is a natural-language-based interface, especially for expressing the intentions behind a character's motions. Perhaps not surprisingly, instructions for real people are given in natural language, augmented with graphical diagrams and, occasionally, animations. Recipes, instruction manuals, and interpersonal conversations can therefore use language as their medium for conveying process and action.
We are not advocating that animators throw away their tools, only that natural language offers a communication medium we all know and can use to formulate instructions for activating the behavior of virtual human characters. Some aspects of some actions are certainly difficult to express in natural language, but the availability of a language interpreter can bring the virtual human interface more in line with real interpersonal communication modes. Our goal is to build smart avatars that understand what we tell them to do in the same way humans follow instructions. These smart avatars have to be able to process a natural-language instruction into a conceptual representation that can be used to control their actions. This representation is called a parameterized action representation, or PAR (see Figure 2).
The PAR has to specify the agent of the action, as well as any relevant objects and information about paths, locations, manners, and purposes for a particular action. There are linguistic constraints on how this information can be conveyed by the language; agents and objects tend to be verb arguments, paths are often prepositional phrases, and manners and purposes might be in additional clauses [8]. A parser maps the components of an instruction into the parameters or variables of the PAR, which is then linked directly to PaT-Nets executing the specified movement generators.
Natural language often describes actions at a high level, leaving out many of the details that have to be specified for animation, as discussed in a similar approach in [7]. We use the example "Walk to the door and turn the handle slowly" to illustrate the function of the PAR. Whether or not the PAR system processes this instruction, there is nothing explicit in the linguistic representation about grasping the handle or which direction it will have to be turned, yet this information is necessary to the action's actual visible performance. The PAR has to include information about applicability and preparatory and terminating conditions in order to fill in these gaps. It also has to be parameterized, because other details of the action depend on the PAR's participants, including agents, objects, and other attributes.
The representation of the "handle" object lists the actions that object can perform and what state changes they cause. The number of steps it will take to get to the door depends on the agent's size and starting location. Some of the parameters in a PAR template are shown in Figure 3 and are defined in the following ways:
A PAR can appear as one of two different forms: uninstantiated PAR (UPAR) and instantiated PAR (IPAR): We store all instances of the UPAR, which contains default applicability conditions, preconditions, and execution steps, in a hierarchical database called the Actionary. Multiple entries are allowed, in the same way verbs have multiple contextual meanings. An IPAR is a UPAR instantiated with specific information on agent, physical object(s), manner, terminating conditions, and more. Any new information in an IPAR overrides the corresponding UPAR default. An IPAR can be created by the parser (one IPAR for each new instruction) or dynamically during execution, as in Figure 2.
A language interpreter promotes a language-centered view of action execution, augmented and elaborated by parameters modifying lower-level motion synthesis. Although textual instructions can describe and trigger actions, details need not be communicated explicitly. The smart avatar PAR architecture interprets instruction semantics with motion generality and context sensitivity. In a prototype implementation of this architecture, called Jack's MOOse Lodge [10], four smart avatars are controlled by simple imperative instructions (see Figure 4). One agent, the waiter, is completely autonomous, serving drinks to the seated avatars when their glasses need filling. Another application runs a military checkpoint (see Figure 5).
Given this architecture, do we see the emergence of realistic humanlike movements, actions, and decisions? Yes and no. We see complex activities and interactions. But we also know we're not fooling anyone into thinking that these virtual humans are real. Some of this inability to mimic real human movements and interactions perfectly has to do with graphical appearance and motion details; real humans readily identify synthetic movements. Motion captured from live performances is much more natural, but more difficult to alter and parameterize for reuse in other contexts.
One promising approach to natural movement is through a deeper look into physiological and cognitive models of behavior. For example, we have built a visual attention system for the virtual human that uses known perceptual and cognitive parameters to drive the movement of our characters' eyes (see Terzopoulos's "Artificial Life for Computer Graphics" in this issue). Visual attention is based on a queue of tasks and exogenous events that can occur arbitrarily [1]. Since attention is a resource, task performance degrades naturally as the environment becomes cluttered.
Another approach is to observe human movement and understand the qualitative parameters that shape performance. In the real world, the shaping of performance is a physical process; in our simulated worlds, assuming we choose the right controls, it may be modeled kinematically. That's why we implemented an interpretation of Laban's effort notation, which characterizes the qualitative rather than the quantitative aspects of movement, to create a parameterization of agent manner [1]. Effort elements are weight, space, time, and flow and can be combined and phrased to vary the performance of a given gesture.
Within five years, virtual humans will have individual personalities, emotional states, and live conversations [11]. They will have roles, gender, culture, and situational awareness. They will have reactive, proactive, and decision-making behaviors for action execution [6]. But to do these things, they will need individualized perceptions of context. They will have to understand language so real humans can communicate with them as if they were real.
The future holds great promise for the virtual humans populating our virtual worlds. They will provide economic benefits by helping designers build more human-centered vehicles, equipment, assembly lines, manufacturing plants, and interactive systems. Virtual humans will enhance the presentation of information through training aids, virtual experiences, teaching, and mentoring. They will help save lives by providing surrogates for medical training, surgical planning, and remote telemedicine. They will be our avatars on the Internet, portraying ourselves to othersas we are, or perhaps as we wish to be. And they may help turn cyberspace into a real community.
1. Badler, N., Chi, D., and Chopra, S. Virtual human animation based on movement observation and cognitive behavior models. In Proceedings of the Computer Animation Conference (Geneva, Switzerland, May 810). IEEE Computer Society, Los Alamitos, Calif., 1999, pp. 128137.
2. Badler, B., Phillips, C., and Webber, B. Simulating Humans: Computer Graphics Animation and Control. Oxford University Press, New York, 1993; see www.cis.upenn.edu/~badler/book/book.html.
3. Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, W., Douville, B., Prevost, S., and Stone, M. Animated conversation: Rule-based generation of facial expression, gesture and spoken intonation for multiple conversational agents. In Proceedings of Computer Graphics, Annual Conf. Series (Orlando, Fla., July 2429). ACM Press, New York, 1994, pp. 413420.
4. Chi, D., Webber, B., Clarke, J., and Badler, N. Casualty modeling for real-time medical training. Presence 5, 4 (Fall 1995), 359366.
5. Earnshaw, R., Magnenat-Thalmann, N., Terzopoulos, D., and Thalmann, D. Computer animation for virtual humans. IEEE Comput. Graph. Appl. 18, 5 (Sept.-Oct. 1998), 2023.
6. Johnson, W., and Rickel, J. Steve: An animated pedagogical agent for procedural training in virtual environments. SIGART Bulletin 8, 14 (Fall 1997), 1621.
7. Narayanan, S. Talking the talk is like walking the walk. In Proceedings of the 19th Annual Conference of the Cognitive Science Society (Palo Alto, Calif., Aug. 710 1997.
8. Palmer, M., Rosenzweig, J., and Schuler, W. Capturing motion verb generalizations with synchronous tag. In Predicative Forms in NLP: Text, Speech, and Language Technology Series, P. St. Dizier, Ed. Kluwer Press, Dordrecht, The Netherlands, 1998.
9. Perlin, K., and Goldberg, A. Improv: A system for scripting interactive actors in virtual worlds. In Proceedings of ACM Computer Graphics, Annual Conference Series (New Orleans, Aug. 49). ACM Press, New York, 1996, pp. 205216.
10. Shi, J., Smith, T., Granieri, J., and Badler, B. Smart avatars in JackMOO. In Proceedings of IEEE Virtual Reality'99 Conference (Houston, Mar. 1317). IEEE Computer Society Press, Los Alamitos, Calif., 1999, 156163.
11. Thorisson, K. Real-time decision making in multimodel face-to-face communication. In Proceedings of the 2nd International Conference on Autonomous Agents (Minneapolis-St. Paul, May 1013). ACM Press, New York, 1998, pp. 1623.
12. Wilcox, S. Web Developer.com Guide to 3D Avatars. John Wiley & Sons, New York, 1998.
This research is supported by the U.S. Air Force through Delivery Orders #8 and #17 on F41624-97-D-5002; Office of Naval Research (through the University of Houston) K-5-55043/3916-1552793, DURIP N0001497-1-0396, and AASERTs N00014-97-1-0603 and N0014-97-1-0605; Army Research Lab HRED DAAL01-97-M-0198; DARPA SB-MDA-97-2951001; NSF IRI95-04372; NASA NRA NAG 5-3990; National Institute of Standards and Technology 60 NANB6D0149 and 60 NANB7D0058; Engineering Animation, Inc., SERI, Korea, and JustSystem, Inc., Japan.
©1999 ACM 0002-0782/99/0800 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 1999 ACM, Inc.
No entries found