September 2009


Data from the Coverity Scan Open Source Report found a 16 percent improvement in the quality of open source projects actively participating in the scan.  This is exactly the type of data that open source vendors and proponents want in their back pockets.  But is it accurate? Hold your tomatoes and let me explain.

Coverity used lines of code (LOC) before a defect was found to evaluate code quality and code quality improvements in a normalized fashion.  There are however two approaches for doing this calculation.

The first approach is to calculate the total LOC scanned across all projects and divide by the total number of defects found across all projects.  The result is the average number of LOC before a defect is identified.  This one step approach is best for talking about the code quality of the overall code base, across projects.  It does not however let us determine which projects are producing higher quality code versus other projects, or track results on a project by project basis over time.

The second approach is a two step process.  First, calculate the LOC per defect on a project by project basis.  Then, average those results to arrive at the LOC per defect across all the projects scanned.  The first step of this approach is good for understanding which projects are producing higher quality code and measuring a project’s progress in the quality arena over time.  However, averaging the LOC per defect across projects in step two disregards the size of a project’s code base.

Before we move on, keep in mind that a higher LOC per defect number is preferable regardless of which approach we use.  For example, a 999 LOC per defect result is better than a 100 LOC per defect.

As I’m sure you’d expect, or I wouldn’t be blogging about this, the two approaches provide different results due to how the code base size of a given project is weighted.  In approach one, projects with larger code bases are weighted higher than projects with smaller code bases.  The reverse is true for approach two.

Here’s an example:

Project A: 2,000,000 LOC & 2,400 defects
Project B: 20,000 LOC & 20 defects
Project C: 21,000 LOC & 15 defects

Using the first approach, the LOC per defect would be: 838

[2,000,000 + 20,000 + 21,000] / [2,400 + 20 + 15]
= [2,025,000] / [2,435]
= 838

Using the second approach, the LOC per defect would be: 1,078

([2,000,000 / 2,400] + [20,000 / 20] + [21,000 / 15]) / 3
= ([833.33] + [1,000.00] + [1,400.00]) / 3
= 1,078

Notice the impact of the much smaller projects B and C on the overall results.

The Coverity report used approach two.  When I first saw the data I instinctively used approach one.  Both approaches are statistically valid.  Which approach you use comes down to what you’re testing for.  Coverity is interested in determining code quality at the individual project level. The projects whose leads have submitted code into Coverity care much more about the individual project’s results. I was more interested to determine code quality improvements across the set of projects scanned.

Coverity reported that open source code quality of the projects scanned had improved from LOC per defect of 3,333 in 2006 to approximately 4,000 in 2009.  This led Coverity to claim a 16 percent overall improvement in the quality of open source projects actively participating in the scan.

Using approach one, I found that LOC per defect has worsened from 1,982 in 2008 to 1,560 in 2009.  This represents a 21 percent decline in the quality of the open source code base included in the scan.

2008: 55,000,000 LOC / 27,752 total defects found = 1,982 LOC per defect
2009: 60,000,000 LOC / 38,453 total defects found = 1,560 LOC per defect

Note that 2006 data was not included in the 2009 report, or else I would have calculated 2006 versus 2009.

Readers can decide which of these two figures, a 16 percent improvement or 21 percent decline, to use for their purposes.  Both are valid interpretations of the data.

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Microsoft announced a new “Spark” program targeted at small web development shops with fewer than 10 employees. WebsiteSpark provides the following Microsoft development and production software licenses:

  • 3 licenses of Visual Studio 2008 Professional Edition
  • 1 license of Expression Studio 3 (which includes Expression Blend, Sketchflow, and Web)
  • 2 licenses of Expression Web 3
  • 4 processor licenses of Windows Web Server 2008 R2
  • 4 processor licenses of SQL Server 2008 Web Edition
  • DotNetPanel control panel (enabling easy remote/hosted management of your servers)

These licenses are provided at no cost for the first three years.  After this term, the web development company, or individual consultant for that matter, must decide whether to continue using the licenses for $999 or $199 per year.  There’s an option to stop using the licenses all together.  But after three years of building skills with the Microsoft stack, I don’t see a significant portion of participants leaving the program.

To monetize the WebsiteSpark program, Microsoft will help participants find a hosting provider for the website/web application developed for their end clients.  Hosting providers offering a Microsoft runtime stack pay software license fees to Microsoft.  Even if the web development company decides to leave the WebsiteSpark program after the three year term, their clients whose website/web application is already running will continue to pay for hosting.  As a result, Microsoft will continue collecting license fees from the hosting providers.

Additionally, since there are only 3 licenses of Visual Studio, Microsoft could also generate license revenue from the fourth through tenth employee at the web development company.

So who exactly should care about this program?  Well, early-stage web development companies or a consultant just starting out is probably the target.  This company or consultant likely has .NET skills, but would prefer to see their business take off before paying for software licenses.  In other words, they are Microsoft customers to lose.  In the past the company or consultant would have been forced to look at (L)AMP because of the upfront cost consideration.

The response on ScottGu’s blog announcing the program has been overwhelmingly positive. Again, that’s because the target are Microsoft friendly ISVs or consultants who now have one less reason to look at (L)AMP.

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

I just read that Google is set to release and open source an Internet Explorer (IE) plug-in that allows IE to use Google Chrome’s HTML rendering and JavaScript engine. Ars Technica writes:

“Google hopes that delivering Chrome’s rendering engine in an IE plug-in will provide a pragmatic compromise for users who can’t upgrade. Web developers will be able to use an X-UA-Compatible meta tag to specify that their page should be displayed with the Chrome renderer plug-in instead of using Internet Explorer’s Trident engine. This approach will ensure that the Chrome engine is only used when it is supposed to and that it won’t disrupt the browser’s handling of legacy Web applications that require IE6 compatibility.”

Maybe I’m being too negative, but I’m wondering what user problem this plug-in truly solves.  Don’t get me wrong, I like Chrome, but its not hard to install and run two or more browsers on a machine.  Some companies do restrict installing software, be it a new browser, or a plug-in to IE.  However, the Google Chrome plug-in doesn’t address this issue as Ars found out:

“We asked Google if it will be providing packages and tools to make it easier for IT departments to deploy the plug-in. It’s still much too early for that, Google explained, but it’s something that Google might explore when the project matures.”

I could see the value of the plug-in to an IT administrator who doesn’t want to support yet another entire browser.  However, Google faces a significant hurdle to IT adoption without tools to deploy and manage the plug-in.  And really, who else is this plug-in targeted at if not for enterprise users, and the IT administrators who provision and manage IT resources to these enterprise users.  Home users that want to use Chrome would simply install Chrome, not a Chrome plug-in to IE.

The Google team working on the plug-in “cited the ubiquity of Flash as an example of how the plug-in strategy could have the potential to move the Web forward.”  Well, until Adobe AIR came out, the defacto interaction mode with Flash was through a browser.  On the other hand, the defacto interaction mode with a browser is not through another browser. Not sure that I’d be betting the plug-in’s success on the adoption of Flash.

Bygones.

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Oracle reported Fiscal Q12010 results yesterday.  Why should open source vendors care?  Well, there’s obviously a concern when the vendor you’re hoping will buy you has a bad quarter.  Ok, ok, all joking aside, the reason open source vendors, and specifically middleware vendors should care is apparent when looking at Oracle’s Application versus Database and Middleware new license revenue. By the way, Oracle, a database is just another middleware category in most IT circles.

In $ Millions Fiscal 1Q 2009 Fiscal 1Q 2010 % Change
Applications New License Revenue 331 317 -4%
Database & Middleware New License Revenue 906 711 -22%

Source: Oracle 1Q2010 Financial release (Pg. 10 of 12)

Oracle’s Applications new license revenue declined by 4 percent year to year.  While not great, a 4 percent decline is understandable in today’s economy.  The real shocker is the 22 percent year to year decline in Oracle’s “Database & Middleware” new license revenue.  Clearly the economy plays a role in these results, but 22% is nearly 6x higher than the decline in new Applications licenses.

Since enterprise applications rely on middleware and databases, what’s driving the 22 percent decline?

Well, first, customers who’ve just received their maintenance bills, after a 20 to 40 percent hike, are thinking twice about deploying new middleware and database workload with the applications vendor.  I say “applications vendor” because both Oracle and SAP have hiked their maintenance rates.  As a result, customers have taken a second look at their combined Applications and Middleware spending.

It’s difficult, not impossible, for an SAP applications customer or an Oracle applications customer to vote with their wallets and buy applications from a vendor that isn’t raising prices by 20 to 40 percent.  It’s much easier for these customers to purchase middleware from someone other than Oracle and SAP.  Oracle’s revenue strongly suggests that this is occurring. As Redmonk’s James Governor writes:

“Oracle and SAP – being an app duopoly doesn’t mean you can raise maintenance fees at will. customers are playing hard ball right back at you”

Now would be a great time for open source middleware vendors to target SAP and Oracle applications customers.  One could argue that open source middleware vendors are already doing this, as demonstrated by Oracle’s revenue.  And that’s probably true.  So to are commercial enterprise middleware vendors; I know that IBM is doing just this. But keep in mind that SAP and Oracle applications projects take months, if not years to implement.  So there’s still time to insert open source into an SAP and Oracle projects.  Go forth and prosper!  Or at least use the open source product for better overall prices from SAP and Oracle.

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Having just completed the annual IBM Intellectual Property training, and while thinking more about the CodePlex Foundation, I saw the following Open World Forum conference track description:

“The growing use of Open Source and economics of outsourcing have made testing for intellectual property (IP) cleanliness and proper satisfaction of legal obligations an essential task for ensuring quality and market acceptability. Real or perceived IP issues can delay product cycles and derail entire projects or business transactions. “

Upon further digging, I realized that Protecode, a company I wrote about back in 2008, was playing a key role in this track.

It goes without say that enterprises using open source code within their software development process should have policies in place to protect the enterprise.  Clearly there’s a risk of contaminating a custom enterprise application by misusing open source code.  But in most cases, the enterprise can be safeguarded unless the derivative work needs to be distributed outside of the enterprise’s walls.  With applications delivered over the web, very few enterprises find the need to distribute their internally developed software.  However, whether the enterprise is distributing the derivative work or not, there’s also a risk of patent infringement.

That’s where Protecode comes in with its three pronged approach:

Enterprises can, and should, create policies for developers, on the enterprise’s payroll and contracted via consultants or off-shoring, to utilize open source code appropriately.  But that can’t be the only line of defense.  Enterprises must be able to retroactively and proactively ensure that code their developers are writing is free of intellectual property concerns.  Being able to analyze existing software assets with a product such as Protecode’s Enterprise IP Analyzer is step one.  But the real goal should be validating IP on the fly, with a product such as Protecode’s Developer IP Assistant.  There’s also the interim step of testing IP ownership during builds with a product such as Protecode’s Build IP Analyzer.

I wonder what portion of enterprises have analyzed their existing software assets to validate that they are in fact the rightful IP owners to the entirety of their internally developed software.  Or better yet, what portion of enterprises that analyzed their software assets were surprised with the results!

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

MySpace announced that they’re open sourcing Qizmt, a MapReduce framework used by the MySpace Data Mining team. Unlike other leading MapReduce frameworks that are typically implemented in C++ or Java, Qizmt was developed using C#.NET. MySpace’s Chief Operating Officer, Mike Jones writes:

“This extends the rapid development nature of the .NET environment to the world of large scale data crunching and enables .NET developers to easily leverage their skill set to write MapReduce functions. Not only is Qizmt easy to use, but based on our internal benchmarks, we have shown its processing speeds to be competitive with the leading MapReduce open source projects on a lesser number of cores.”

Count me surprised by the claims that Qizmt can perform comparably with open source MapReduce projects, even while using fewer processing cores. I’d love to hear more about the performance benchmarks. But that’s another story.

Here’s why this story caught my attention:

“Many companies leverage Microsoft technologies in their BI platforms and Qizmt is a natural extension to these platforms. As companies deal with continued data growth and deeper analytics needs, Qizmt becomes a more integral part of BI both from a data processing and a data mining perspective.”

I couldn’t agree more. With the number of companies and ISVs that rely on .NET, Qizmt could become an important technology for .NET ISVs and customers. This is where CodePlex.org steps in. By helping Microsoft ISVs and customers get comfortable with contributing their IP into Qizmt, CodePlex.org could help Qizmt mature a lot faster than is likely with MySpace simply hosting the project on Google Code, as is the case today.

For appearance sake, CodePlex.org may not want Qizmt as the first project it shepherds. Qizmt’s strong .NET and Microsoft linkage will not go unnoticed by those of us watching how the CodePlex Foundation will shift from vision to execution. But here’s an important fact; us watchers, don’t have skin in the CodePlex Foundation game, and likely won’t for some time, if ever. The CodePlex Foundation should start with an audience that could have skin in the game, namely .NET users. As the Foundation demonstrates its independence and value to the community, the Microsoft/.NET linkage will dissipate. But to get there, the CodePlex Foundation needs to show value to developers and to projects soon.

What do you think?

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

With Hadoop World NYC just around the corner on October 2, 2009, I thought I’d share two pieces of news.

First, I’ve received a 25% discount code for readers thinking about attending Hadoop World. Hurry because the code expires on September 21st. Use: http://hadoop-world-nyc.eventbrite.com/?discount=hadoopworld_promotion_infoworld

Second, a Q&A with New York Times Software Engineer and Hadoop user, Derek Gottfrid. Derek’s doing some very cool work with Hadoop and will be presenting at Hadoop World.

Question: What got you interested in Hadoop initially and how long have you been using Hadoop?

Gottfrid: I’ve been working with Hadoop for the last three years. Back in 2007, the New York Times decided to make all the public domain articles from 1851-1922 available free of charge in the form of images scanned from the original paper. That’s eleven million articles available as images in PDF format. The code to generate the PDFs was fairly straightforward, but to get it to run in parallel across multiple machines was an issue. As I wrote about in detail back then, I came across the MapReduce paper from Google. That, coupled with what I had learned about Hadoop, got me started on the road to tackle this huge data challenge.

Question: How do you use Hadoop at the NY Times and why has it been the best solution for what you’re trying to accomplish?

Gottfrid: We continue to use Hadoop as a one-time batch process for tremendous volumes of image data at the New York Times. We’ve also moved up the food chain and use Hadoop for traditional text analytics and web mining. It’s the most cost-effective solution for processing and analyzing large sets of data, such as user logs.

Question: How would you like to see Hadoop evolve? Or, What are the 3 features you’d most like to see in Hadoop?

Gottfrid: I’d like to see the Hadoop roadmap clarified as well as the individual subprojects to get rid of some of the weird interdependencies so we can get to a legitimate 1.0 release that solidifies the APIs.

Question: What can attendees expect learn about Hadoop from your preso at Hadoop World?

Gottfrid: In my session which I’ve titled “Counting, Clustering and other Data Tricks” I’m planning to take attendees on the journey I’ve gone through at the New York times using Hadoop for simple stuff like image processing to the more sophisticated web analytics use cases I’m working on today.

Question: What are you hoping or expecting to get out of Hadoop World?

Gottfrid: I attended the Hadoop Summit in the Silicon Valley, and now I’m interested to see what people in our eastern region are doing with Hadoop. I’m always open to learning new tricks and tips to better leverage the platform.

I’ll be at Hadoop World to find out how companies are using Hadoop today, and what use cases will pop up in the future.

Will you be there?

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Next Page »