• ApacheCon 2009: Apache Hadoop with Aaron Kimball

aaron_kimball1Aaron Kimball is the first engineer at Cloudera, which brings Apache Hadoop to the enterprise.  Kimball is the first engineer hired be engineer hired by Cloudera. Before his engineering role at Cloudera, Kimball was a graduate student at the University of Washington and earned his Masters in Computer Science, with a focus on programming languages. It was in graduate school that Kimball was introduces to Hadoop and started studying and working with the system in depth. During that time, Kimball also interned at Google where he continued to work with Hadoop and helped develop educational resources for universities to use for teaching Hadoop to undergrad students.

Kimball discussed with The Bitsource on some of the use cases of Hadoop, and his upcoming Hadoop Training at ApacheCon.


Aaron Kimball  is the first engineer hired by Cloudera; I’ve been here for almost exactly a year now, and it’s been great watching the company evolve even in this relatively short time. Before Cloudera I was a graduate student at the University of Washington. I earned my master’s degree in Computer Science with a focus on programming languages. It was while I was in graduate school that I was turned on to Hadoop and started studying and working with the system in depth. During that time I also had a Google internship where I continued to work with Hadoop and helped develop educational resources for universities to use to teach Hadoop to undergrads.

Q: Please provide a general background about the Hadoop training being delivered at the event?

Aaron Kimball: At ApacheCon, we’ll be providing two days of training, which compose Cloudera’s “basic” and “intermediate” developer training courses. We’ll discuss why Hadoop is an important tool, and understand how its built, and how its architecture affects the applications you write. The first day goes on to cover some basic algorithms expressed as MapReduce applications and includes an exercise where you’ll practice implementing a MapReduce program yourself.

The second day covers the broader Hadoop ecosystem. Specifically the Sqoop, Pig, and Hive tools. Sqoop helps users import databases into Hadoop. The morning is focused on the use of this tool and others in creating data pipelines which involve Hadoop for large-scale data processing. The afternoon covers Pig and Hive: higher-level languages that allow users to efficiently describe common data processing idioms and perform ad-hoc queries over their datasets. We’ll continue to perform exercises and get familiar with these tools as well.

Q: What are some cool things being done with Hadoop?

Aaron Kimball: Hadoop is in use across many different industries. Web properties like Yahoo! and Facebook use Hadoop to power their sites. Yahoo builds its search index using Hadoop. Facebook performs deep social network analytics. It’s in use in banking, credit card processing, astrophysics research, genome processing, advertising, and several other fields.

Q: What is the largest implementation of Hadoop at the moment?

Aaron Kimball: Yahoo continues to hold the title for largest Hadoop deployment. They have a number of clusters, but their biggest has 32,000 CPU cores and 16 PB of disk storage spread across approximately 4,000 nodes. Their total storage under Hadoop clusters amounts to some 86 PB of disk space.

Q: What makes Hadoop unique for data processing?

Aaron Kimball: Hadoop is the most scalable data processing platform generally available today. Hadoop clusters can store and process orders of magnitude more data than other parallel processing systems or data warehouses. The MapReduce programming model allows developers to focus on building the analytic algorithms that accomplish their high-level goals, while the Hadoop framework concentrates on performing this computation efficiently. This separation of concerns allows for faster application development time and reliably performs computation on a scale that other distributed systems do not match.

Q: What skills will attendees walk away with after taking your Hadoop training at ApacheCon?

Aaron Kimball: The first day focuses on gaining a familiarity with Hadoop and the ability to write basic MapReduce applications. The second day focuses on building data pipelines, and introductory use of higher-level systems such as Hive and Pig, which are built on top of Hadoop. We’ll be using a virtual machine with Hadoop preinstalled along with some exercises and datasets; attendees will understand how the various Hadoop components fit together and spend some time practicing working in this environment.

Q: Would you recommend this course to people simply interested in Hadoop out of curiosity? Does it have applications for personal or research use?

Aaron Kimball: All attendees are welcome — you don’t need to have a specific use case in mind to get something out of the training. After working through it, you’ll probably have a much better understanding how it applies to the sorts of problems you work with.

The course is pretty technical. It doesn’t require that you understand any particular programming language (though familiarity with Java will be helpful). But you should have prior programming experience.

There are a number of research programs in computer science, physics, and other departments which make use of Hadoop clusters in their work.

It’s unlikely that there’s a “personal use” for Hadoop. The data volumes at which Hadoop becomes interesting are pretty large; dozens of gigabytes on the low end. Most people don’t have a casual need for this scale of batch data processing. But I think that the number of job markets where batch data processing is important will continue to grow. Anyone interesting in building out their development skill set, or who is interested in alternatives to databases and data warehouses for large-scale data storage and processing are welcome to come and learn.


Apache Hadoop http://hadoop.apache.org/

Cloudera http://www.cloudera.com/

Aaron Kimball’s ApacheCon Profile http://www.us.apachecon.com/c/acus2009/speakers/255

Register for ApacheCon with Code OAK-Bit

register_for_apachecon_gif

Share and Enjoy:
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • E-mail this story to a friend!
  • FriendFeed
  • HackerNews
  • LinkedIn
  • Reddit
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Technorati
  • Twitter
  • FSDaily
  • Ping.fm

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

Leave a Reply

You must be logged in to post a comment.