
This week I attended a very interesting meeting of the San Francisco Bay Area Chapter of the ACM on the topics of Hadoop and HIVE. I was not the only one interested by MapReduce related projects, since the meeting nicely hosted by LinkedIn at their office of Mountain View, had more than 250 people.

Dr. Amr Awadallah from Cloudera did a very good introduction to Hadoop since a lot of attendees were not very familiar with this java open source version of MapReduce. It is interesting to mention that Desktop product offered by Cloudera is free. Amr explained that Cloudera business model is to offer professional services, training and specific for fees features out of the core of the main product.
Cloudera web site has a lot of good training material on Hadoop and MapReduce. Amr mentioned for example that Hadoop was used at LinkedIn to create and store the recommendations on the fly "People you may know" whereas the profile information is managed by a more traditional RDBMS data store.

However some people do use VMWare images clusters to run Hadoop MapReduce tasks and the question is “is it efficient?”. The answer lies in the way network performance and I/O in general is handled by both the images and the Hadoop scripts.
They was also an interesting question about the fact the Google has several patents on MapReduce this might be an obstacle to the development of open source product on top of hadoop. Amr did not seem to really worry about this.

There were some good questions about data skew issues with Hive and Hadoop as well as comparison between HIVE and ASTER. Like Amr did with virtualization and Hadop, Ashish tried to oppose both approaches in simple terms: in a way ASTER is MapReduce applied on top of a RDBMS layer whereas HIVE is a RDBMS layer running on top of MapReduce.
Both presentations:
- Hadoop: Distributed Data Processing (Amr Awadallah)
- Facebook’s Petabyte Scale Data Warehouse (Ashish Thusoo)
No comments:
Post a Comment