Four articles, published across the March through May issues of Communications, highlight how people are the unique source of the adaptive capacity essential to incident response in modern Internet-facing software systems. While it's reasonable for software engineering and operations communities to focus on the intricacies of technology, there is not much attention given to the intricacies of how people do their work. Ultimately, it is human performance that makes modern business-critical systems robust and resilient.
As business-critical software systems become more successful, they necessarily increase in complexity. Ironically, this complexity makes these systems inherently messy so that surprising incidents are part and parcel of the capability to provide services at larger scales and speeds.13 Studies in resilience engineering2,12 reveal that people produce resilient performance in messy systems by doing the cognitive work of anomaly response; coordinating joint activity during events that threaten service outages; and revising their models of how the system actually works and malfunctions using lessons learned from incidents. People's resilient performance compensates for the messiness of systems, despite constant change.
Thus, incidents that threaten service outages are endemic as an emergent side effect of the increasing complexity of the interdependencies required to provide valuable services at scale. Incidents will continue to present challenges that require resilient performance, regardless of past reliability statistics. It is the cognitive work, coordination across roles, and adaptive capacity of people that resolve anomalies as they threaten to grow into service outages.4 To be more specific: modern business-critical systems work as well as they do because of the adaptive capabilities of people; and without the cognitive work that people engage in with each other, all software systems eventually fail (some with increasingly catastrophic impact, given the criticality of the services they provide).1
Richard Cook connects human performance to software tooling through his insightful "Above the Line/Below the Line" diagram.5 Cook points out that discussions focused solely on the technology miss what is actually going on in the operations of Internet-facing applications. Figure 1 in Cook's article reveals the cognitive work and joint activity that go on above the line and places the technology and tooling for development and operations below the line. The "line" here is the line of representation. No one can directly inspect or influence the processes running below the line; all understanding and action are mediated through representations.
Below the line are the facilities engineers use to develop, change, update, and operate software that enables valuable services. This includes all the components needed to create the value that businesses provide to customers: the technology stack, code repositories, data sources, and a host of tools for testing, monitoring, deployment, and performance measurement, as well as the various ways of delivering these services.
The above-the-line area in the diagram includes the people who are engaged in keeping the system running and extending its functionality. They are the ones preparing to deploy new code, monitoring system activities, and re-architecting the system. These people ask questions such as: What's it doing now? Why is it doing this? What's it going to do next? This cognitive work—observing, inferring, anticipating, planning, and intervening, among others—is done by interacting, not with the things themselves, but with representations of them. Interestingly, some representations (for example, dashboards) are designed by (and for) software engineers and other stakeholders.
Notice all the above-the-line actors have mental models of what is below the line. These models vary depending on people's roles and experience, as well as on their individual perspectives and knowledge. Notice that the actors' mental models are different. This is because there are general limits on the fidelity of models of complex, highly interconnected systems.11 This is true of modern software systems and is demonstrated by studies of incident response; a common statement heard during incidents or in the postmortem meetings afterward is, "I didn't know it worked that way."12 Cook's concept and diagram reframes how Internet-facing systems function and is utilized by the other articles in the set.
Systems are developed and operate with finite resources, and they function in a constantly changing environment. Plans, procedures, automation, and roles are inherently limited; they cannot encompass all the activities, events, and demands these systems encounter. Systems operate under multiple pressures and virtually always in degraded mode.13
The adaptive capacity of complex systems resides in people. It is people who adapt to meet the inevitable challenges, pressures, trade-offs, resource scarcity, and surprises that occur. A slang term from World War II captures both the state of the system and the acceptance of the people who made things work: SNAFU (situation normal, all fouled up). With this term, soldiers were acknowledging that this is the usual status and their jobs were to make the flawed and balky parts work. If SNAFU is normal, then SNAFU catching is essential—resilient performance depends on the ability to adapt outside of standard plans, which inevitably break down.
However technologically facilitated, SNAFU catching is a fundamentally human capability that is essential for viability in a world of change and surprise. Some people in some roles provide the essential adaptive capacity for SNAFU catching, though the catching itself may be local, invisible to distant perspectives, or even conducted out of organizational view.9
Surprises in complex systems are inevitable. Resilience engineering enhances the adaptive capacity needed for response to surprises. A system with adaptive capacity is poised to adapt. It has some readiness to change how it currently works—its models, plans, processes, behaviors—when it confronts anomalies and surprises.11 Adaptation is the potential to modify plans to continue to fit changing situations. NASA's Mission Control Center in Houston is a positive case study for this capability, especially how Space Shuttle mission controllers developed skill at handling anomalies, expecting that the next anomaly they would experience was unlikely to match any of the ones from the past that they had practiced or experienced.10
IT-based companies exist in a pressurized world where technology, competitors, and stakeholders change. Their success requires scaling and transforming infrastructure to accommodate increasing demand and build new products. These factors add complexity (for example, having to cope with incident response involving third-party software dependencies) and produce surprising anomalies.1,12 Knowing they will experience anomalies, IT-based companies, organizations, and governments need to be fluent at change and poised to adapt.13
Marisa Grayson describes her results from examining a key function above the line by studying the cognitive work of anomaly response as people respond to an evolving incident.6 Grayson focuses on the general function of hypothesis exploration during anomaly response.14 Hypothesis exploration begins with recognition of an anomaly (that is, a difference between what is observed and the observer's expectations). Those expectations are derived from the observer's model of the system and the specific context of operations. Anomaly recognition in large, interconnected, and partially autonomous systems is particularly difficult. Sensemaking is challenging when monitoring a continuous flow of changing data about events that might be relevant. This is the norm for many Internet-facing business systems: Data streams are wide and fast flowing; normal variability is high; alert overload is common; operations and observations, as well as technology, are highly distributed. To make matters worse, the representations typically available require long chains of inference rather than supporting direct visualization of anomalous behaviors.
Grayson's results show how practitioners generate, revise, and test potential explanations that could account for the unexpected findings. She developed a method to diagram and visualize hypothesis exploration based on the above-the-line/below-the-line framework.
Her charts reveal the typical flow of exploration where multiple hypotheses are generated to account for the anomalies, and the hypotheses in this set change over time. As response teams converge on an assessment of the situation, they frequently revise what are deemed candidate hypotheses and their relative confidence across the possibilities. In her study, Grayson found that sometimes a hypothesis that had been considered confirmed was overturned as new evidence came to the fore.
In hindsight, people focus on the answer that resolved the incident. The quality of anomaly response, however, is directly related to the ability to generate and consider a wide range of hypotheses and to revise hypotheses as the situation changes over time—for example, when interventions to resolve problems end up producing additional unexpected behavior.
Laura Maguire's article8 expands the above-the-line frame by examining what coordination across multiple roles looks like when events threaten service outages, especially how people adapt to control the costs associated with coordinating the joint activities needed to resolve the situation. The value of coordination across roles and perspectives is well established as people wrestle with uncertainty and risk during an incident response. Handling anomalies in risky worlds such as space mission operation centers is one example.10 But studies of joint activity also reveal the costs of coordination can offset the benefits of involving multiple people and automation in situation management.7
However, this earlier research looked at anomaly response anchored in physical control rooms where responders were collocated in open workspaces. Internet-facing software systems are managed differently, as the norm is for responders to be physically distributed. People connect via ChatOps channels, unable to observe each other. The cognitive costs of coordination are greater for geographically distributed groups. Maguire's article describes how this both enables and constrains joint activity. For example, growth has led to third-party software dependencies that require coordination across organization (and company) boundaries during anomaly response.
In her research, Maguire asks the question: What do practitioners do to control the costs of coordination as they carry out anomaly response under uncertainty, risk, and pressure? Her results are based on studying how software engineers experience these "costs" across a set of incident response cases. They highlight the shortcomings of traditional ways of coordinating roles and managing the costs of coordination (for example, incident commander, disciplined procedure-following (based on an incident command system), and efforts to use IT prosthetics such as bots). Maguire's work reveals how people adapt when the costs of coordination become larger. Understanding these adaptations can help in the design of effective tools, alter roles, and build organizational frameworks that enhance joint activity and reduce the costs of coordination during incident response.
There is a significant gap between how we imagine incidents occur (and are resolved) and how they actually occur.3 J. Paul Reed considers how organizations learn to close this gap (see his article on p. 58 of this issue). He broadens the perspective to reveal the factors that affect how learning from incidents can be narrow and reactive or broad and proactive. Broad, proactive learning keeps pace with change, continually recharging the sources of adaptive capacity that lead to resilient performance.
Reed's research highlights an important but often invisible driver of work above the line—the ways people capture lasting memories of past incidents and how these memories are used by those not present or involved with handling the incidents at the time. How do people come to understand what happened? How do they share attributions about why it happened? Why do some incidents attract more organizational attention than others?
Organizations usually reserve limited resources to study events that have resulted in (or come close to) significant service degradation. Social, organizational, and regulatory factors constrain what learning is possible from such events. In contrast, proactive learning about resilient performance and adaptive capacities focuses on how cognitive work usually goes well despite all of the difficulties, limited resources, trade-offs, and surprises. The data and analyses in previous reports illustrate the potential insights to be gained from in-depth examination of the cognitive work of incident response.2,12
In this piece, and for the others in the set, a theme repeats: incidents are opportunities to update and revise models of the ways organizations generate and sustain adaptive capacities to handle surprising challenges as IT systems grow and operate at new scales. If you take the view that systems are up, working, and successful because of the adaptive capacity that people have, then incidents can be reframed as ongoing opportunities to update and revise mental models as the organization/technology/infrastructure changes, grows, and scales.4
Together, the four articles provide a sketch of what is happening above the line of representation, especially during incident response. These activities are essential to building, fielding, and revising the modern information technology on which our society increasingly depends. Understanding how people detect anomalies, work together resolving incidents, and learn from those experiences is essential for having more resilient systems in the future.
The intimate relationship between human expertise and the technological components of modern systems defies linear decomposition. As Cook shows, there is really only one system here—how the system depends on an awareness of how people's capacity to adapt is sometimes facilitated and at other times frustrated by the technology. The articles by Grayson, Maguire, and Reed demonstrate how looking at incidents through the lens of cognitive work, joint activity, and proactive learning provides new insights about how this human-technology system really works. Incidents are challenges that reveal the system doesn't work the way it has been imagined. The experience of the incident and post-incident inquiry offer learning opportunities highlighting where mental models need revision.
The articles go further, though. Together they highlight how everyone's mental models of Internet-facing software systems are in need of significant revision. Human cognitive, collaborative, and adaptive performance is central to software engineering and operations. As the scale and complexity of the software systems necessary to provide critical services continue to increase, what goes on above the line will remain central to all stories of growth, success, precariousness, and breakdown.
Understanding, supporting, and sustaining the capabilities above the line require all stakeholders to be able to continuously update and revise their models of how the system is messy and yet usually manages to work. When organizations value openness to continually reexamining how the system really works, they can follow the tangible paths these articles provide to learn how to learn from incidents.
1. Allspaw, J. 2016. Human factors and ergonomics practice in web engineering and operations: navigating a critical yet opaque sea of automation. Human Factors and Ergonomics in Practice. S. Shorrock and C. Williams, eds. CRC Press (Taylor & Francis) Boca Raton, FL, 2016, 313—322.
2. Allspaw, J. Trade-offs under pressure: Heuristics and observations of teams resolving Internet service outages. Master's thesis, 2015. Lund University, Lund, Sweden.
3. Allspaw, J. Incidents as we imagine them versus how they actually are. PagerDuty Summit 2018. YouTube; https://www.youtube.com/watch?v=8DtzmV1jiyQ.
4. Allspaw, J. and Cook, R.I. SRE cognitive work. Seeking SRE: Conversations about Running Production Systems at Scale. D. Blank-Edelman, ed. O'Reilly Media, 2018, 441—465.
5. Cook, R.I. Above the line, below the line. Comm. ACM 63, 3 (Mar. 2020), 43—46.
6. Grayson, M.R. Cognitive work of hypothesis exploration during anomaly response. Comm. ACM 63, 4 (Apr. 2020), 97—103..
7. Klein, G., Feltovich, P.J., Bradshaw, J.M. and Woods, D.D. Common ground and coordination in joint activity. Organizational Simulation. W. Rouse and K. Boff, eds. Wiley, 2005, 139—184.
8. Maguire, L.M.D. Managing the hidden costs of coordination. Comm. ACM 63, 4 (Apr. 2020), 90—96.
9. Perry, S.J. and Wears, R.L. 2012. Underground adaptations: cases from health care. Cognition, Technology & Work 14, 3 (2012), 253—260; doi.org/10.1007/s10111-011-0207-2.
10. Watts-Perotti, J. and Woods, D.D. Cooperative advocacy: a strategy for integrating diverse perspectives in anomaly response. Computer Supported Cooperative Work: The Journal of Collaborative Computing 18, 2 (2009), 175—98.
11. Woods, D.D. Four concepts of resilience and the implications for resilience engineering. Reliability Engineering and Systems Safety 141 (2015), 5—9; doi:10.1016/j.ress.2015.03.018.
12. Woods D.D. Stella Report from the SNAFUcatchers Workshop on Coping with Complexity, 2017; https://snafucatchers.github.io/.
13. Woods, D.D. Resilience is a verb. IRGC Resource Guide on Resilience (vol. 2): Domains of Resilience for Complex Interconnected Systems. B.D. Trump, M.-V. Florin, and I. Linkov, eds. EPFL International Risk Governance Center, Lausanne, Switzerland, 2018. https://www.researchgate.net/publication/329035477_Resilience_is_a_Verb.
14. Woods, D.D. and Hollnagel, E. Joint Cognitive Systems: Patterns in Cognitive Systems Engineering. CRC Press (Taylor & Francis), Boca Raton, FL, 2006.
Copyright held by authors/owners. Publications rights licensed to ACM.
Request permission to publish from [email protected]
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.
No entries found