Subject: Failure, science and software testing
From: npdoty@ischool.berkeley.edu
Date: 1/07/2010 06:44:00 PM To: Brian (Microsoft), Vignesh (Microsoft), Mubarak (Microsoft), Tracy (Microsoft), Jolie (Microsoft), Ben Cohen Bcc: https://bcc.npdoty.name/

Have you guys read this Wired article on failure and science?

I thought it was really reminiscent of the constant failures we run into as programmers, and the particular challenge of being a software tester.

The main premise is that scientists get so comfortable with accepted theory and the status quo that they don't recognize that failures might be breakthroughs instead of just mistakes in their own equipment or experimental method. There's certainly some value there -- the author gives the example of sensitive radio telescope static finally being accepted as cosmic background radiation and not a problem with the dish, and cites some serious ethnographic research of scientists and how they make discoveries. But the article makes it sound like the solution is simply to be skeptical, to assume that every unexpected experimental result is a potential new discovery.

But anyone who's taken high school physics knows that this assumption that it must be your fault not the theory's fault isn't just some elitist fallacy, it's born of experience. Of the hundred times that the results of your high school physics experiment didn't match what theory predicts, how many times was it because the theory was wrong? Zero, of course; it's always a screw-up with your experimental set-up (at least it was in high school, and I bet the percentage doesn't change that much once you're a professional).

As programmers, we learn this lesson even more often. The first rule of programming, after all, is that it's always your fault. This isn't dogma, every one of us has learned it from this quintessential and eternally repeated experience where we write a piece of code, it doesn't work, we assume that it must be a problem with the operating system or the compiler or the other guy's code -- that the computer simply isn't doing what we told it to do -- until we realize the mundane truth when we actually look at our own code and the documentation and find that we'd just made another stupid mistake.

And that's the real trick of software testing: it can be tempting, particularly at first, to file a bug every time something doesn't work. Young confident software testers go to their dev several times a day saying "I found a bug" only to realize that they hadn't called the function with the correct parameters. But this lesson of experience quickly leads to the opposite problem: having become so accustomed to being the cause of our problems (like any programmer), we just fidget until we get the software to work, unconsciously working around bugs that we should be filing.

So I think the real answer, both to the scientific problem and the software-testing one, isn't mere undying skepticism, but in knowing which failures are probably your fault and which ones aren't. And a lot of the techniques that experimental scientists and software testers are the same: the first step for both is reproducing the failure. Lehrer's article also suggests talking to someone who isn't intimately familiar with the experiment, and I think we software testers often understand an unexpected result when we try to explain the bizarre situation to a tester from another team. "Encourage diversity" is also on his list, and I think the Test Apprentice Program at Microsoft was a darn good example of that in action -- being the only non-CS majors on our teams, we often found different bugs.

Maybe experimental science could even learn something from software testers. I thought one of the more valuable things we got from learning test-driven development was that a test wasn't good unless you'd seen it fail. If you've only ever seen a test pass, then how do you know that it really tests what you claim it tests? That must be harder for physicists (they can't briefly turn off a particular universal parameter to ensure that the experiment fails under those conditions), but the same sort of counterfactual thinking (rather than just writing a test and being happy when it turns green, or running an experiment and assuming that the result confirms the theory) seems important to me.

Do we get a lot of good software testers from experimental science backgrounds? Maybe that's where we should be hiring from. Anyway, I highly recommend the Wired article, if only for the comfort that programmers aren't alone in the universe for having their experiments fail constantly.

Hope you're all doing well -- grad school is great, but, as you can see, I still miss software testing from time to time,
Nick


Subject: Institutional bugs
From: npdoty@gmail.com
Date: 2/02/2009 06:29:00 PM To: Brian (Microsoft), Jon (Google), Bob (Microsoft), Vignesh (Microsoft), Nick (Google), Mubarak (Microsoft), Tracy (Microsoft), Jolie (Microsoft) Bcc: https://bcc.npdoty.name/

Dear Google and Microsoft friends,

It's pretty exciting to see software testing come so prominently into the news twice in such a short time frame. I know that neither of you can share any of the internal discussion you've heard on these topics, but I sure would have enjoyed watching the threads these events sparked. Are these issues getting talked about a lot outside of the groups immediately impacted?

Really, I've been able to see quite a bit just looking in from the outside. It's pretty neat to see the actual source code of the Zune leap year bug and hear the exact wildcard problem in this weekend's Google badware bug -- it makes me feel like I'm not so far away from the industry after all. (Which isn't to say there isn't some advantage from knowing some people on the inside: it was fun when I was at Microsoft last month to hear about how our friend on the Zune test team got a call at 7 AM on a day when most people weren't expected at work telling him he needed to be in the office immediately. That must have been a pretty intense day. ;)

I've heard conjecture (fueled by the short-lived rumor that StopBadware was somehow responsible rather than Google itself) that the mistake happened because Google got an updated list from StopBadware and just checked it in verbatim, rather than Google mistakenly adding the wildcard in itself.

And it's similar to the discussion I saw around the Zune leapyear issue. Speculation raged about how a Microsoft developer could make such a mistake or how the Zune test team could miss it. Then when it came out that it was actually a bug in Freescale Semiconductor's code, suddenly it made sense to everyone: only the Zune 30 had the problem, none of the newer Zunes have that problem because they no longer rely on a third-party vendor's code. And more significantly, it wasn't that Microsoft developed code with such a glaring hole. Or that Google deployed a file with such an obvious error. It's as if we're comforted by thinking that Google and Microsoft weren't the responsible entities; that at least fits with our understanding of these software companies.

But neither of those explanations helps the Google customer or the Zune customer, nor should they be any solace to them. Microsoft and Google are just as responsible for code they ship that was originally written outside the company. And really, if anything, it's an opportunity for a Microsoft SDET and a Google QA engineer to get a promotion.

Sure, whatever Google engineer checked in the file should be getting a talking to: wouldn't a single manual test have caught the issue? When you're making a change to code that'll be run as part of every Google search, shouldn't you at least have tested it once yourself? But it's much more an issue of why there wasn't an automated check-in test that prevented the change from going in at all. A single negative automated test case would have caught this and relying on all your individual engineers to never make mistakes like this is foolish.

Also, I happen to think that the Zune leapyear bug should have been caught by a developer's unit tests: shouldn't a unit test for a piece of leap year code include a case for the end of a leap year? But a Microsoft SDET could make some significant improvements for his product by proposing a policy to do code reviews of partner code. Collaborations are inevitable, and it would be worse for the company to have the already frustrating Not-Invented-Here syndrome institutionalized as an official company practice under the name of quality assurance. Test plans and code reviews are just as valuable for partner code as for code written internally.

Of course I know that none of you can speak for either company any more than any single person could ever represent such a huge group of people, practices and institutions. For that matter, I have faith that both the Zune team and the Google Search team have already come to these conclusions and implemented something along these lines. But I'm curious what your thoughts are, since you might be able to bring this idea up as a reminder in your group and in the next group over and that maybe we can all have a little more discussion about it. And that's exactly the point: we expect Google to not make mistakes like this because we expect such a powerful single entity to be so consistent. But Google isn't such a single entity -- any one engineer will make mistakes and any one partner will be unreliable. But since Google the institution is so powerful, it can be as perfect as we expect, not by being a single infallible entity, but by putting practices in place -- like a culture of quality assurance and a system of unit and check-in testing. In both of these high profile cases, the issues were institutional bugs, not code defects.

Perhaps that's all obvious to you guys; to someone just looking back on the software business, it seemed important.

Anyway, hope you're doing well and that you're enjoying software development. Grad school is pretty great, but I do miss being more intimately involved.

Nick