Sunday, July 29, 2007

Learning Microsoft how to get the facts

"There are three kinds of lies: lies, damned lies and statistics."

Well, not much has been changed since Mark Twain spoke these wise words. The recent survey of Associate Professor Alan D. MacCormack, most well-known for his article "Intellectual Property, Architecture, and the Management of Technological Transitions: Evidence from Microsoft Corporation" proves this once again. True, the very first page clearly shows Alan's association with the software giant by clearly stating that it concerns "a study funded by the Microsoft Corporation". Usually this is enough for us Linux zealots to write it off and concern ourselves with more important things. Still, eWeek found it important enough to mention. Knowing a thing or two about statistics and how to manipulate them, I thought let's see how well the professor does.

In short, it is garbage. There are three basic flaws:
  • The sample taken is far too small to be representative;
  • The selection is flawed;
  • The interpretation of the questions and the responses is done by the researchers.

The sample is too small
At the time of writing, there were 42,909 projects listed at Freshmeat of which about 72.5 percent had a GPL or LGPL license. That is about 31,102 projects. For the sake of argument, let's assume they are all one man projects. Because that is not true. A study from the Haas School of Business shows that there were 121 maintainers and 2,605 developers working on the Linux kernel in 2000. For your information, that is kernel version 2.2.x.

So on one hand we have 31,102 open source developers and 2,726 Linux developers. What do you consider to be a reasonable representative sample? 10 percent? Less? One percent, which boils down to 311 open source developers and 27 Linux developers? No. According to our Harvard professor only 34 open source developers (of which 7 are Linux developers) are required. Every high school kid will tell you that you can draw no significant conclusions from such a small sample.

If you don't believe me, I will allow Alan to speak for himself: "Based on the selection criteria for the developers, and the semi-structured approach we felt that the 34 interviews was more than sufficient to conduct exploratory research." and a bit further on: "The semi-structured approach we felt that the 34 interviews was more than sufficient to conduct exploratory research to identify the predominant developer opinions on the most critical issues."

Dear Alan, how many times do I have to explain to you that it doesn't matter how you feel about matters, but how well you can prove them. That has been a sound scientific principle for the last two thousand years. You'll never make a full professor that way!

And to prove that I know what I'm talking about, here is the proper way to determine a reasonable sample size. First you need three parameters:
  • Population size
  • Confidence level
  • Confidence interval

The confidence interval is the plus-or-minus figure usually reported in newspaper or television opinion poll results. For example, if you use a confidence interval of 5 and 47% percent of your sample picks an answer you can be "sure" that if you had asked the question of the entire relevant population between 42% (47-5) and 52% (47+5) would have picked that answer. The wider the confidence interval you are willing to accept, the more certain you can be that the whole population answers would be within that range.

The confidence level tells you how sure you can be. It is expressed as a percentage and represents how often the true percentage of the population who would pick an answer lies within the confidence interval. The 95% confidence level means you can be 95% certain; the 99% confidence level means you can be 99% certain. Most researchers use the 95% confidence level.

So, using a 95% confidence level and a confidence interval of 5 we need to interview 379 developers, not 34. See Alan, proving your point isn't too hard..

The selection is flawed

Okay, maybe there were too many open source developers who didn't want to participate in this survey. Yell 'Microsoft sponsored' and most of us Linux bigots will say 'thank you' and hang up. So how many developers were actually contacted? I'd say from the various mailing-lists it shouldn't be a problem to contact at least 10,000 developers. No. Only 354 were contacted. Less than 10% responded. The others either used the 'bounce' option in KMail or flatly declined.

Now the real fun comes. Alan quotes Lakhani and Wolf who did a structured quantitative study of 684 developers from 287 distinct projects in 2005. They divided the community in four distinct clusters:

"In the first cluster, developers were most commonly motivated to contribute to open source because of a work need or because they were paid to contribute. All developers in the second cluster were motivated by non-work needs. Developers in the third cluster were most commonly motivated by intellectual stimulation or a desire to improve their skills. Finally, developers in the fourth cluster were most commonly motivated by a belief that they were obligated to give back in return for having used open source code or a belief that code should be free."


This classification is used by this study, although they found it appropriate to merge clusters two and three. Why?

"In this research, we combined clusters two and three because they have the same set of top motivations that are distinct from clusters one and four, and they both have intellectual stimulation as the second highest motivation."


Pardon me? First of all, I object to this classification in the first place. When I start programming I usually have an itch to scratch, either for work or for pleasure. When the project really begins to interest me I start to explore uncharted ground, just for fun. And yes, since I've been using Linux and associated programs I feel obliged to give something back in return. That puts me in all four categories. Second, when I have a problem to solve (let's say to change the copyright message in all my private sources – wow, that's intellectually motivating!) my motivation puts me in cluster two. When I'm fiddling around with a tiny multitasking environment I've suddenly shifted to cluster three. Since my motivations are quite different in both cases I don't see any reason to merge both clusters.

All four clusters have roughly the same size, about a quarter. Depending on their answers developers are assigned to one of these clusters:

"We assigned developers to one of three groups based on their response."


The distribution of these clusters is as follows:

Cluster%MacCormack%Lakhani
15625
2+32456
42119

This shows that either Lakhani's research was flawed (where the distribution of clusters is concerned) or MacCormack's sample is not aselect. Using the "de Hond method" this could be corrected, but MacCormack fails to do so.

Finally, Alan even admits he has made an atypical selection:

"We targeted developers for our research based on two criteria: the projects to which they had contributed, and their role on those projects."


Even if this was intentional, by doing this Alan severely limits the applicability and validity of his research. Also, criteria for the importance of the role of those interviewed are lacking. Table B just lists them as "developers" which can be anyone, even the guy who merely wrote the print routine. Remember those 2,726 Linux developers?

Interpretation is done by researchers

This really makes me shiver. I do not hold much of opinions to begin with. Opinions are like noses; everybody has got one..! I don't like highway troopers to stop me just because they thought I was speeding too much. All judges will require at least some kind of measurement. MacCormack doesn't think that is necessary:

"Given the complexity of licensing implications, we felt the topic was not well suited for a structured / quantitative survey. Instead, we used a semi-structured document to facilitate discussion and conduct exploratory research to identify developers’ opinions on open source and proprietary software licensing issues."


In laymens terms this means "say what you think and we'll tell you what you mean":

"From the responses, we used an inductive approach to synthesize the developers’ responses into key themes. After defining these themes, we looked across responses to identify indicative phrases and responses of a pro or con position on each theme. We then compared each developer’s statements against these indicators to classify each developer as either pro or con on that theme. If a developer provided statements that were mixed (i.e. matched both pro and con indicators for a theme), we examined their responses to related questions. We used the broader context to assign them as pro or con on the theme."


MacCormack also fails to state why licensing implications are more complex than the war in Iraq, the greenhouse effect and save the whales. What does MacCormack? He feels again. Never seen such a sensitive professor. BTW, Lakhani found no problem in using a structured, quantitative approach to survey his topic. Open source is obviously not as complex an issue as closed source.

I won't go so far as to say that MacCormack has deliberately manipulated his findings, but in any case his research is seriously flawed IMHO. I'm a lecturer at colleges and universities too and when some student of mine would offer me a survey like this he wouldn't make the grade, believe me. I advise Microsoft to buy its surveys somewhere else in the future.

No comments: