Distributed Cognition and Joint Activity in Computer System Administration

This is a paper by Paul P. Maglio, Eser Kandogan, and Eben Haber from the IBM Almaden Research Center that analyzes an architecture change. This is an interesting paper for a number of reasons. One is that it reveals you don’t have to have an incident or an outage, in order to use some sort of analysis and learn from things. Learning from normal work and what was ultimately successful is valuable as well. The authors perform their analysis using the lenses of distributed cognition and joint activity.

In some respects this paper also reminds me of John Alspaw’s thesis, in that it contains transcripts around what happened and also contains a graph of utterances at different times.

Overview of the event

Here’s a broad overview of the event, which I’ll talk about in a bit more detail later on:

Admin needs to add another “player server” to handle load.
- This requires setting up the server, configuring the “maestro” instance, and changing firewall rules.
Admin receives instructions by email on how to add the player instance, but it’s generic. They have to fill in some stuff that’s specific to their use case, especially port numbers.
Before the researchers started observing, Admin sent a request to the network team to add 2 firewall rules.
- Allow 7137 from an external server to an internal server, and
- Allow 7236 the other way, from internal to external.
Once the firewall rules were in place, Admin started working off the instructions.

Overall the work to be done may sound simplistic, but that is one of the advantages of examining “normal” work, revealing the complexities within it. In this case, there is a centralized “maestro” server which interact with other “player” servers on the other side of the firewall. For load balancing purposes another player server is needing to be added, so that there will then be 1 maestro and 2 player servers. As simple as this may sounds, this event will last just over 2 hours.

We see most of the event through the eyes of Admin, the person responsible for making the change, but there are several other people involved as well:

The project architect (Archi)
Technical support for the product (Tech)
Admin’s colleague who had access to the same systems (Colle)

Unnamed mentioned participants include a customer relationship manager, project executive, a product developer, and Admin and Colle’s manager.

The event

The Admin starts off this event by going into his email to retrieve the instructions he’d received that included commands and instructions on how to at the new server.

As is likely familiar, the instructions aren’t at all customized for his specific use but contain commands with placeholders like internal port that he must replace. As we’ll see, a theme of this events, and a property of working with distributed cognition is that translation is occurring, this is only the first and perhpas smallest instance of it.

He does so, and receives no error, so he goes to the next step and receives:

Cannot reach server: Error 1231A

This is confusing in a number of ways, it doesn’t say which direction or which server seems to be expressing the issue. This is where the problems start. Though the authors divide the event into 3 distinct episodes, I’ll highlight the 2 that are the most important.

Also during the event admin is getting asked for updates by both the customer relationship manager as well as the “project executive”.

While Admin was working with Archi and Tech, Admin was the only one who had direct access to the system and its configuration. As a result, all information passed through Admin and was mediated by various system representations he’d provide as they asked questions.

Colle on the other hand could access the system directly, so he could work by himself, his primary communication with Admin being reporting his findings and making suggestions. They also had to communicate through various system representations as well though.

Episode 1 “Do you have the manual?”

This episode took place after about an hour of work between Admin and Archi. At this point Archi suggests calling support. Admin seems pretty wary of this and mentions that in his experience, support’s solution to things is to just “reinstall, reinstall.”

I think this is interesting, because it potentially speaks to not going in with an open or already having drawn the conclusion. On the other hand, I of course understand wanting to prioritize things that seem like they would be effective. In any case, Admin begins an IM session with support.

About 45 minutes into the chat session Tech asked to confirm that the “listen port 7234 or 7237 is listening?”. As the authors point out, this is the right question as it points to exactly what the issue is. 7234 was the listening port by default for the first instance and 7137 (probably type out as 7237) with the new instance’s listen port. At this point, Admin mentions (to himself) at first that this is in fact the problem, but then decides no, “that should be fine.”

The authors speculate that It may have fixated on the 7234 bit and filtered out other information when replying to Tech:

It is listening 7234… Is it okay that it listens on the same port of the default instance?

At this point, Tech seems to have realized the issue:

Tech: Don't think so.
Tech: Do you have the manual?
Tech: I'm trying to find it... working from home today.

Unfortunately, when Admin sees this, he asked to Archi on the phone saying:

You got to be kidding me! Oh God, this support guy is asking me for the manual.

Tech continues to try and get a response from Admin, but Admin stops responding shortly thereafter, saying to himself “This guy is totally useless.”

Analysis of this episode

We can see here that how people coordinate is of course important, in this case in particular we can look at how the information flowed. Between Admin and Tech, the information was altered by Admin’s understanding (correct or otherwise).

We can use our understanding of common ground and joint activity to notice, as the authors do, that Tech asking about the manual can be viewed as an attempt at beginning a joint activity (or “joint project” as they call it), which Admin ultimately declines to participate in.

It’s possible to further analyze, and the authors do so, the breakdown in the joint activity. I won’t cover it here in detail, because I don’t think it adds a lot to our understanding as much of it is speculation about Tech’s intentions. I think there are cases where when we analyze our own events with people we have worked with or have access to where that speculation can be valuable, but given this sort of thirdhand information I don’t think we can learn a whole lot from here.

Episode 2 “What are you talking about?”

This episode takes place between Admin and his “close colleague,” Colle. Colle had been told by the customer relationship manager to help admin and its ultimately their work together that solved the problem.

This episode highlights that working remotely may actually make some forms of analysis easier. In this case Colle goes to Admin’s office at about an hour into the event, which is captured because of the recordings by the researchers.

This is of course unlikely to be easily available to many of us, but as more and more teams are distributed, this is also decreasingly needed to capture these interactions.

For the most part, Colle worked from their office, adjacent to Admin. Unlike our other characters, Colle had direct access to the servers and used his own computer. This changes the interaction, its no longer filtered by Admin’s understanding, though much of the discussion is still taking place through representations of the system state.

About 2 hours into the event, Colle discovers that the maestro server was trying to communicate with the player server on port 7137 and IMs Admin.

Colle: We were supposed to use 7236. Unconfigure that instance and ...
Admin: Can’t specify a return port... you only specify one port

Colle then goes on to explain how he arrived at this conclusion, by pasting commands that he ran to test this and confirm.

The authors describe what happens next as “the exchange became more heated,” though I think perhaps that is an understatement.

Colle: You specified the wrong port. 
Admin: No, I didn't.
Colle: You did it wrong. Yes, you did. You need to 
put in 7236. 
Admin: we just didn't tell to go both ways. The 
other port has nothing to do with this.
Colle: Well, all I know is what I see in the conf file
Admin: we thought that was the return port. That 
is not a return port. 
Colle: there currently is no listener on <internal server> on 7137. So use 7236. DO IT!

At this point Admin calls Colle on the phone to discuss. I think this pattern is fairly familiar to many of us. When text communication seems to be breaking down we often look for a higher bandwidth and/or more synchronous form of communication.

Analysis of Episode

I don’t want to judge Admin too harshly without knowing more about the situation. I can’t help but notice though how multiple people around Admin seem to have to convince him that they are right instead of perhaps leading with an assumption that they may be or even that they don’t need to be for their input to be valuable.

When the authors analyze this episode they talk a lot about the different communication mediums. And as I mentioned above, I think that’s relevant to. I also think that something important is being overlooked here.

And that is that communication mediums play a part based on their inherent nature, that is synchronous or asynchronous or having tone, but that they may also be used (for better or worse) to make up for a skill or experience gap.

Admin, obviously frustrated by the exchange, called Colle on the phone[2:03:45]: 

Admin: What are you talking about? 7236? 
Colle: Yeah? 
Admin: We thought that it came in on 7137 and went back on 7236, but we were wrong, that 7236 is like an HTTPS listener port or something? 
Colle: It will still come in on 7135 to talk to maestro server apparently... 
Admin: right? 
Colle: What's happening is it's actually trying to make a request back, um, through the 72... well actually trying to make it back through the 7137 to the instance... 
Colle: ..and it's not happening. 
Admin: I know. I know that. But I can't tell it to... 
Colle: .. just create it with the 7236. Trust me. 
Admin: Why? That port's not, that's going the wrong, that's only one way too. 
Colle: Trust me. Admin: It’s only one way. Do you understand what I am saying? 
Colle: Cause it's the maestro talking back to the player server instance. Admin: Yeah, but how does the player instance talk to maestro to make some kind of request? Colle: 7135 is the standard port it uses in all cases. So we had it wrong. Our assumption on how it works was incorrect. 
Admin: All right, all right. 
Colle: If it doesn't work you can beat me up after Admin: I want to right now. (Laughter on both sides).

All that aside there are some other important things that we can get from this episode. As mentioned above this episode is different from the others in that Collie doesn’t need all the information to pass through Admin. Coley can examine the system state directly and at one point even remarks on this saying “all I know is what I see in the conf file”.

This is another example of the translation as well as transformation that is occurring throughout. There’s a level of translation we saw earlier which may be subsituting what I think internal port means with some number or the reverse, but also both humans and machines transforming artifacts like the conf file. The conf file becomes an artifact used to communicate as they each can reference it.

This means that unlike before, admin and Collie can have different views into the system state and different views into the problem.

We can see this again as an attempt to start a joint activity, where Coley wants admin to run his commands and confirm his testing and ultimately make the changes suggested. As before though, at least at first, Admin declined to take up activity.

The authors tell us that Coley mentioning that they had been mis-understanding all along and also joking was not actually about establishing common ground around the state of the system. But instead was to establish a different joint activity that would be successful in getting Admin to follow his directions.

Colle found that rather than debugging Admin’s knowledge of the system state (repeatedly explaining the port settings should be), he had to debug Admin’s model of the system itself (explicitly stating “our assumption is wrong” about the direction of the ports)

This helps us see that the discussion can be thought of in multiple layers, discussing the system state itself or discussing the understanding of the system state.

A thing I don’t real like here is that Admin has to be convinced so much that Colle. Just from this event it seems like a bit of a pattern. Why should everyone have to cajole him into trying something, especially when he comes to them for help? Its not like he has an answer, Colle even showed the commands and testing that he did, he even provided evidence!

This sort of situation is an example of how practicing certain ways of working or communication or practicing with the medium could be useful.

Takeaways

There are multiple levels of translation and transformation occurring.
It’s possible to analyze things that aren’t incidents, learning from normal work is important.
The medium used in communication can affect its nature (e.g. phone vs text vs in person).
Some communication is performed through system representations.
Who has access to what part of a system and how will affect how they view and solve problems.
There are multiple levels of communication as well, including communication about the state of a system as well as communication about the current understanding of the state of the system.
Joint activity can be proposed and refused (perhaps without even realizing it).

Subscribe to Resilience Roundup

Subscribe to the newsletter.