ACM

Communications of the ACM

Home/Magazine Archive/December 2011 (Vol. 54, No. 12)/Debugging on Live Systems/Full Text

Kode Vicious

Debugging on Live Systems

By George V. Neville-Neil
Communications of the ACM, December 2011, Vol. 54 No. 12, Pages 32-33
10.1145/2043174.2043186
Comments

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share:

dead code illustration — Credit: Aba / Shutterstock.com

Dear KV

I have been trying to debug a problem on a system at work, but the control freaks that run our production systems don't want to give me access to the systems on which the bug always occurs. I have not been able to reproduce the problem in the test environment on my desktop, but every day the bug happens on several production systems. I am at the point of thinking about getting a key logger so I can steal the passwords necessary to get onto the production systems and finally see the problem "in the wild." I have never worked for such a bunch of fascists in my entire career.

Locked Down and Out

Dear Locked

First of all, while most companies are inherently nondemocratic, few of them are fascist. Fascism went out of style sometime around 1945 and really hasn't made a comeback since. Secondly, I do sympathizeno one should be prevented from fixing a bug simply because of lack of access to the appropriate systems.

What many programmers and technical people fail to comprehend is that, as a colleague recently put it, "access implies responsibility." This is why the sudo program has the warning, stolen from the Spider-Man comics: "With great power comes great responsibility."

Debugging a program or a system can, and often does, have negative side effects, either by slowing down the system or changing the results of some calculation in an unintended fashion. The people who run your production systems are right to be wary of letting any random programmer loose in their domain. If you break something, it is likely to come down on their heads, and they will have to fix it while you stand there glumly repeating, "Well, it wasn't supposed to do that!"

Your best bet is to try setting up a production system outside of the production environment first, as a test machine. I am surprised by how many companies work without such staging machines, going directly from the developers' desktops to their production environments. If the bug won't happen without real workloads, then it is time to get a machine in the production environment sufficiently isolated so that it can be given a workload without destroying the machines that are doing productive work.

By now you might have noticed that this advice is less technical and more about social engineering. Programmers must be willing to work with the people who have to keep systems up 24 hours a day, 7 days a week, if they want to be trusted enough to be able to debug live or near-live systems.

Two final thoughts: using a keyboard logger is not a way to gain trust, and telling someone in a public column that you're thinking about it is as dumb as tweeting your murder plans.

Dear KV

A program I have just been handed at work keeps crashing, and each time I look at it in the debugger and examine various bits of memory I see the pattern 0xdeadc0de in different parts of allocated memory. Is this a joke? Do you think that my coworkers are hazing me?

0xDead Tired of this Code

Dear 0xDead

It is common practice for programmers to set memory to an easily recognizable value when they are trying to debug memory-smash bugs. You might think they would clear all the bytes in the buffer to be 0x00, but that does not help if some piece of code is writing NULL bytes all over your buffers. Using a known pattern such as 0xdeadc0de makes it easier to find these problems in a debugger. As you have seen, you print a buffer and you see the pattern. If instead you saw, say, 0xde00c0de, you would know that someone had written a NULL byte in the middle of your memory. Maybe you wanted that, maybe you didn't, but now, at least, you can clearly see it. For extra cleverness points you can set a watchpointif it is supported by your hardwarewhich stops the program if some variable or part of memory does not equal 0xdeadc0de. I tend to set buffers I am debugging to be all 0x69, because if I see that number, then I know it is my own personal bit of work.

For programmers who deal with network packets, a known pattern has another advantage. Most people write code on systems based on the Intel x86 architecture, which is known in network parlance as a little-endian system. A little-endian system stores the most significant byte of a multibyte word last. Network protocols are big endian, which is the opposite of how x86 processors store data in memory. All network programmers know the C macros htonl (), ntohl (), htons (), and ntohs (), which do the proper swapping of host-to-network endianness and back. A good way to debug a network protocol is to transmit data such as 0xdeadc0de in the packets and then make sure it does not look like 0xdec0adde when it arrives in your program's memory. Using this trick makes it easier to figure out where you might have left out a byte-swapping macro.

It is common practice for programmers to set memory to an easily recognizable value when they are trying to debug memory-smash bugs.

So, much as I would like to think your coworkers are hazing you, it is far more likely they are trying to be helpful.

Author

George V. Neville-Neil ([email protected]) is the proprietor of Neville-Neil Consulting and a member of the ACM Queue editorial board. He works on networking and operating systems code for fun and profit, teaches courses on various programming-related subjects, and encourages your comments, quips, and code snips pertaining to his Communications column.

No entries found

Debugging on Live Systems

Dear Locked

Dear KV

Dear 0xDead

Author

Article Contents: