I stumbled across this analysis of the Linux Kernel which brought back “fond” memories of my market opportunity forecasting days.

In the analysis, the author, kripken, estimates that “at most, 60% of the Linux Kernel is GPLv2 code”. Read his methodology here, but I’ll summarize.

He wrote a program that scanned license statements found at the beginning of source code files. The program then attempted to match the license text against patterns to determine if the file was licensed under GPLv2 and above, GPLv2 only, GPL version unspecified or Other. The program tracked the size of the file, not the number of files nor the number of lines licensed under a given license. The results:

License # Bytes % Bytes
GPL 2 or above 60,637,907 39%
GPL 2 only 32,215,150 21%
GPL, Ver unspecified 19,773,264 13%
Other 43,762,840 28%
All Combined 156,389,161 100%

In a follow up post, kripken, compares his results vs. a much less thorough analysis that Linus did using:

[torvalds@g5 linux]$ git-ls-files ‘*.c’ | wc -l
Result= 7978
[torvalds@g5 linux]$ git grep -l “any later version” ‘*.c’ | wc -l
Result= 2720

Comparing the two, we see that Linus estimates 34% (2720 / 7978) of the kernel being “GPL 2 or above”, while kripken estimated 39%. As kripken says himself, the two pieces of analysis point towards a relatively similar result, but his analysis took several hours, and Linus needed about 10 seconds.

So what did we learn?
I’m all for using “perfect” data and analysis to make decisions. But sometimes, actually, most of the time, perfect data isn’t available. This can call into question the analysis that relies on the imperfect data. In my days of forecasting, I’d often explain to colleagues and execs that the right data wasn’t available, so here are some assumptions I’m making and its impact on the final results. Some would quickly “get it” and make a decision based on “the best data and analysis available within the timeframe at hand”. Others couldn’t get over the hurdle of using imperfect data to make decisions, and would attempt to find “the missing data”.

I remember discussing this with a manager at the time. He said something like:

“You’ll find that there’s very little you can tell a really good executive that he/she doesn’t already know or have a gut feeling for. These people probably got to where they are because they are able to combine disparate sources of imperfect data (i.e. a customer call, a conference pitch, talking with their friends, kids, neighbors, etc) to spot trends before the rest of us can. As a result, they’re much more likely to accept analysis based on imperfect data. They’re more worried about acting based on the best analysis available, than deliberating so long that the opportunity has passed.”

That’s one thing open source developers, projects and vendors seem to do really well; spot trends and make decisions without “all the data in the world”. This could be because they’re closer to the user and open source communities foster two-way dialogue between creators and users. Come to think of it, maybe open source actually allows for “better data” collection?