Dear KV,
We have been working with a third-party vendor that supplies a critical component of one of our systems. Because of supply-chain issues, they are trying to "upgrade" us to a newer version of this component, and they say it is a drop-in replacement for the old one. They keep saying this component should be seen as a black box, but in our testing, we found many differences between the original and the updated part. These are not just simple bugs but significant technology changes that underlie the system. It would be nice to treat this component as a drop-in replacement and not worry about this, but what I have seen thus far does not inspire confidence. I do see their point that the API is the same, but I somehow do not think this is sufficient. When is a component truly drop-in and when should I be more paranoid?
Dropped In and Out
Dear Dropped,
Your letter brings up two thoughts: One about recent events and one about the eternal question, "When should a black box be transparent?" While we all know the pandemic has caused incredible amounts of death and destruction to the planet, and the past two years have brought unprecedented attention on the formerly very boring area of supply chains, the sun comes up and the world still spins—which is to say the world has not ended, yet. Supply-chain issues are both real and the world's latest excuse for everything. It is as if children were telling their teachers, "The supply chain ate my homework."
At this point, KV is quite skeptical when a vendor's first excuse is supply-chain issues. Of course, that skepticism will not help unless you have a second supplier for whatever you are buying, which you can use to bludgeon your errant vendor.
Another eternal question, "When is a replacement not a replacement?" is one that will plague us in technology forever. The number of people who believe they can treat whatever they are providing as an opaque box with a fixed API is, unfortunately, legion. This belief comes from the physical world, in which a box is a box, and a brick is a brick, and why would you care if your brick is made from a different material anyway?
Here you see the problem: The metaphor breaks down in the physical world as quickly as it would in the realm of software and hardware. Two bricks may both be red, and therefore present an identical look and feel to the external user, but if they are made of different materials, then they have different qualities—for example, in strength, but let's also consider something less obvious, such as their weight. The number of bricks that can be stacked on top of each other to build a wall depends on their weight, as well as their strength. If you use heavy but weak bricks, well, you can imagine how this goes, and if you cannot, try it—just do not tell your health-insurance plan KV suggested this. And let's say you do not build the wall out of weak and heavy bricks, but years later you replace some damaged bricks with newer, heavier, and weaker bricks. The key here is you would not want to stand near that wall.
A topic KV keeps coming back to is the malleability of software. I keep returning to this because it is this malleability that often results in the catastrophic failures of software and systems engineering. You mentioned you saw timing problems with the new component. I can imagine few situations more treacherous than a change in the timing of a critical component. Timing bugs are already some of the most difficult to track down and fix, and if the timing is off in a critical component, that is likely to affect the system, so good luck debugging that. Those who wish to stand on the "API as a contract" quicksand are welcome to do so, but I am not willing to throw them a rope.
Just because they tested the thing does not mean they tested all the parts your product cares about.
The correct answer in these cases is to ask the vendor for as much information as possible to reduce the risk in accepting this so-called replacement. First, ask for the test plans and test output so you can understand whether they tested the component in a way that relates to your use case. Just because they tested the thing does not mean they tested all the parts your product cares about. In fact, it is unlikely they did. They may have tested just the parts that connect back to the API, rather than the edge cases that would come up when a component is changed in your system.
The past two years have brought unprecedented attention on the formerly very boring area of supply chains.
Second, ask for a complete readout of the differences between the old and new parts. For hardware, this means the underlying technology (for example, the old part was 90nm and the new one is 45nm), and any voltage changes, as well as the internals. I have seen replacement parts that put whole CPU cores into what were once fixed-function pieces of digital electronics, which is utterly insane, but someone, somewhere, is getting praised for adding "flexibility" to the product rather than being admonished for increasing risk.
Lastly, ensure you have a second supplier for any component you deem critical. This ought to go without saying, but, since I am saying it, that means you know it has been an issue for a lot of people I have seen after an upgrade completely destroyed their product.
Oh, and you did ask when to be paranoid. I mean, clearly the answer is, always.
KV
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found