When Cloudera first launched the Cloudera Distribution for Hadoop I wrote: “I’ve typically been down on a support or services-based open source business. However, in the case of Cloudera, this model makes sense — for now.”

Since writing that in March, the string of open source vendors shifting away from selling support to selling products, sometimes under the guise of subscriptions, has marched onward.  I’m happy to report that Cloudera is following the product path with today’s beta release of Cloudera Desktop.

Simply put, Cloudera Desktop makes Hadoop easier to use and manage. This is how the press release describes the product, and after watching the demo, I couldn’t find a more apt description.

Data that enterprises collect daily is critical to business decisions.  However, there’s a problem.  Waiting for IT developers to write analysis algorithms to process the data is sometimes suboptimal, especially for smaller and less complex analysis jobs.  That’s where Cloudera Desktop steps in.

Cloudera Desktop is targeted at not only developers and administrators, but also business analysts.  Opening up the power of Hadoop to non-developers increases the utility of Hadoop in the enterprise.  Clearly business analysts, who often possess scripting skills, will not be able to design algorithms to process complex data analysis tasks.  Hence, the need for Hadoop developers and developer tools does not disappear.  But the broad scale success of Hadoop and Cloudera in the enterprise rides on the coattails of business users, not developers.

To really hit it out of the park, Cloudera will have to make it even easier for business analysts to use Hadoop.  Pre-canned business focused scripts are a start.  Getting away from scripting altogether should be the long term goal for the business user segment.  Let business users create analysis jobs by dragging and dropping artifacts, actions and complex algorithms created by developers onto a job creation pane.  Put the scripts in the background so the business user can always customize the job’s behavior.  If Cloudera can pull this off and win the business user segment, they’ll be hard to beat in the enterprise Hadoop market.

Good luck to them.

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

With Hadoop World NYC just around the corner on October 2, 2009, I thought I’d share two pieces of news.

First, I’ve received a 25% discount code for readers thinking about attending Hadoop World. Hurry because the code expires on September 21st. Use: http://hadoop-world-nyc.eventbrite.com/?discount=hadoopworld_promotion_infoworld

Second, a Q&A with New York Times Software Engineer and Hadoop user, Derek Gottfrid. Derek’s doing some very cool work with Hadoop and will be presenting at Hadoop World.

Question: What got you interested in Hadoop initially and how long have you been using Hadoop?

Gottfrid: I’ve been working with Hadoop for the last three years. Back in 2007, the New York Times decided to make all the public domain articles from 1851-1922 available free of charge in the form of images scanned from the original paper. That’s eleven million articles available as images in PDF format. The code to generate the PDFs was fairly straightforward, but to get it to run in parallel across multiple machines was an issue. As I wrote about in detail back then, I came across the MapReduce paper from Google. That, coupled with what I had learned about Hadoop, got me started on the road to tackle this huge data challenge.

Question: How do you use Hadoop at the NY Times and why has it been the best solution for what you’re trying to accomplish?

Gottfrid: We continue to use Hadoop as a one-time batch process for tremendous volumes of image data at the New York Times. We’ve also moved up the food chain and use Hadoop for traditional text analytics and web mining. It’s the most cost-effective solution for processing and analyzing large sets of data, such as user logs.

Question: How would you like to see Hadoop evolve? Or, What are the 3 features you’d most like to see in Hadoop?

Gottfrid: I’d like to see the Hadoop roadmap clarified as well as the individual subprojects to get rid of some of the weird interdependencies so we can get to a legitimate 1.0 release that solidifies the APIs.

Question: What can attendees expect learn about Hadoop from your preso at Hadoop World?

Gottfrid: In my session which I’ve titled “Counting, Clustering and other Data Tricks” I’m planning to take attendees on the journey I’ve gone through at the New York times using Hadoop for simple stuff like image processing to the more sophisticated web analytics use cases I’m working on today.

Question: What are you hoping or expecting to get out of Hadoop World?

Gottfrid: I attended the Hadoop Summit in the Silicon Valley, and now I’m interested to see what people in our eastern region are doing with Hadoop. I’m always open to learning new tricks and tips to better leverage the platform.

I’ll be at Hadoop World to find out how companies are using Hadoop today, and what use cases will pop up in the future.

Will you be there?

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

Cloudera is clearing making a credible play to become the commercial brand associated with Apache Hadoop.  Not only did Hadoop founder Doug Cutting recently join Cloudera from Yahoo!, Cloudera is set to announce the inaugural Hadoop World Conference, scheduled for October 2nd in NYC.  The conference is being organized by Cloudera founder Christophe Bisciglia with sponsorship from Yahoo!, IBM, Intel, eHarmony and Booz Allen Hamilton.  The tentative agenda has presentations from, amongst others, Cloudera, Yahoo!, Facebook, IBM, Microsoft, eBay, Visa, About.com, NYTimes and JPMorganChase.

Christophe is quoted:

“Hadoop is changing the way that users manage and process ever-increasing volumes of data.  Hadoop World in New York City will showcase this powerful new open source technology, with special focus on how traditional enterprises use it to solve real business problems.”

It’s very smart to host this conference in NYC, where it’ll be easy for IT decision makers and developers from the financial, publishing, advertising and telecom industries to learn more about enterprise use of Hadoop.  In another smart move to drive enterprise adoption, Hadoop training for developers, administrators and managers will be available in conjunction with the conference.  Oh, and did I mention that Cloudera will be offering this training?  Smart move on Cloudera’s part to position themselves as the go-to Hadoop vendor when enterprises want to leverage Hadoop more extensively.

Looks like an interesting conference.  I might make my way over to the “managers” education session if I can swing the travel dates.

Follow me on twitter at: SavioRodrigues

PS: I should state: “The postings on this site are my own and don’t necessarily represent IBM’s positions, strategies or opinions.”

News from today’s Hadoop Summit ’09 got me thinking about the importance of open community again.  According to Joey Echeverria’s tweet from Hadoop Summit ’09, Yahoo! representatives feel:

“Yahoo!: Easier to take an open source project and add steam to it rather than write something from scratch. #hadoopsummit09”

What a difference a year and a half makes.  Back in December 2007, there were roughly 10 Yahoo! employees working on Hadoop, and only five or six outside contributors.  Hadoop found said at the time:

“It’s dominated by Yahoo, it would be great for the project to have a more balanced team.”

Today, Hadoop core has 185 contributors, only 30% of whom are Yahoo! employees.

Also, Cloudera, a commercial company aiming to bring Hadoop to the enterprise, has just contributed a new database tool for Hadoop.  The tool, SQOOP, enables users to directly import large database tables into Hadoop.  According to Cloudera founder:

“SQOOP is a tool that enterprise customers were demanding,” Bisciglia said. “Enterprises have lots of data in existing databases, and if you can’t give them a way to interact with that data, Hadoop isn’t as useful as it could be.”

Much like Kernel.org, Apache HTTPD, and Eclipse before it, a meritocratic, open community is unlocking opportunities for the ecosystem, which in turn is helping Hadoop evolve a lot faster than within any one vendor’s corporate walls..

Today, it appears that most of the contributing vendors are collaborators, with little, if any, head to head competition.  That will surely change over time.  But that’s a good thing. More vendors, more developers, more ideas, more innovation.

One can’t help but wonder what Google thinks of Hadoop’s progress.

Cloudera, an open source startup working to expand the use of Apache Hadoop, made two announcements today.  First, it has secured $5 million in Series A funding today.  Second, the availability of the Cloudera Distribution for Hadoop.

What’s Hadoop? It’s a platform for developing applications that can process vast datasets while scaling to the levels that companies like Google, Facebook and Yahoo require.  Hadoop is an Apache project that:

“implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.”

Cloudera sees a market for Hadoop in enterprise situations from analyzing genome and protein data, oil and gas exploration and financial processing.  The Cloudera Distribution for Hadoop is open source and licensed under the Apache Software License 2.0.  Cloudera intends to drive revenue from support and implementation services.  I’ve typically been down on a support or services-based open source business.  However, in the case of Cloudera, this model makes sense, at least for now.  The number of people who can implement a highly scalable application that processes petabytes of independent data relationship using the MapReduce programming model who don’t work for Google, Yahoo, Facebook and the like can probably be counted on two hands.  There is a degree of education and hand holding that Cloudera needs to do while enterprise developers explore writing this style of applications.

Take a look at the investors and it’s easy to predict that Mike Olson and team won’t be independent for long:

In addition to Accel Partners, investors in Cloudera include Mike Abbott (senior vice president, Palm), David desJardins (early Google employee), Caterina Fake (co-founder, Flickr), David Gerster (entrepreneur), Youssri Helmy (entrepreneur), Dr. Qi Lu (president of the Online Services Group, Microsoft; former executive vice president, Yahoo!), Marten Mickos (former CEO, MySQL), In Sik Rhee (former chief tactician, Opsware; founder, Loudcloud), Jeff Weiner (president, LinkedIn; former senior vice president, Yahoo!), Dick Williams (CEO, Illustra; former CEO, Wily Technology), Gideon Yu (Facebook CFO; former senior vice president, Yahoo!; CFO, YouTube).”

All the best to the Cloudera team.