Thursday, January 28, 2010

Cloudera & Facebook on Hadoop and Hive

This week I attended a very interesting meeting of the San Francisco Bay Area Chapter of the ACM on the topics of Hadoop and HIVE. I was not the only one interested by MapReduce related projects, since the meeting nicely hosted by LinkedIn at their office of Mountain View, had more than 250 people.

Dr. Amr Awadallah from Cloudera did a very good introduction to Hadoop since a lot of attendees were not very familiar with this java open source version of MapReduce. It is interesting to mention that Desktop product offered by Cloudera is free. Amr explained that Cloudera business model is to offer professional services, training and specific for fees features out of the core of the main product.

Cloudera web site has a lot of good training material on Hadoop and MapReduce. Amr mentioned for example that Hadoop was used at LinkedIn to create and store the recommendations on the fly "People you may know" whereas the profile information is managed by a more traditional RDBMS data store.

They were a couple of questions related to the behavior of Hadoop on top of full virtualization products such as those offered by VMWare. The answer from Amr was first to compare the virtualization of platforms and the parallelism involved in MapReduce/Hadoop. In a way the former architecture goal is to have multiple virtual machines running on the same hardware (e.g. a large mainframe or blade boxes) whereas the later is to have an initial processing and storing job done on multiple cheap commodity two rack units (RU) “pizza” boxes at the same time. So in a way these architectures are completely opposite. Of course it is not fair to try to compare the complete virtualization of complete operating systems such as Windows or Linux and the management of basic map and reduce operations even though they have common characteristics (a file system and some processing capabilities).

However some people do use VMWare images clusters to run Hadoop MapReduce tasks and the question is “is it efficient?”. The answer lies in the way network performance and I/O in general is handled by both the images and the Hadoop scripts.

They was also an interesting question about the fact the Google has several patents on MapReduce this might be an obstacle to the development of open source product on top of hadoop. Amr did not seem to really worry about this.

The second presentation was from Ashish Thusoo from Facebook. Some interesting numbers and statistics about the volume of data processed everyday by Facebook (e.g. already 200GB/day in march 2008). Ashish pointed out that it was more interesting for Facebook to have simple algorithms running on large amount of data than complex data mining algorithms running on small volumes. The benefits were more important and the company was learning much more on their users behaviors and profiles. It was back in 2008 that Facebook started to experiment with MapReduce and Hadoop as an alternative to very expensive existing data mining solutions. One of the issue with Hadoop was the complexity of development and the lack of skill among its teams. This is why Facebook started to look at ways to wrap Hadoop in a more SQL like friendly layer. The result is HIVE which is now open source, although Facebook has some proprietary components, especially on the UI side.

There were some good questions about data skew issues with Hive and Hadoop as well as comparison between HIVE and ASTER. Like Amr did with virtualization and Hadop, Ashish tried to oppose both approaches in simple terms: in a way ASTER is MapReduce applied on top of a RDBMS layer whereas HIVE is a RDBMS layer running on top of MapReduce.

Both presentations:
  • Hadoop: Distributed Data Processing (Amr Awadallah)
  • Facebook’s Petabyte Scale Data Warehouse (Ashish Thusoo)
are available as PDF files on the ACM web site.