Showing posts with label MapReduce. Show all posts
Showing posts with label MapReduce. Show all posts

Friday, February 26, 2010

MapReduce an opportunity for Health and BioSciences Applications?

HealthCare and BioScience software products and solutions have embraced Database Management System (DBMS) for their back-end storage and processing for years like most other domains where performance, scalability, security, extensibility, auditing capabilities and maintenance are critical.

In the past few years with alternative or complement technologies such as MapReduce and Hive originally created from the need of extremely high volume web applications such as Google, Facebook or LinkedIn. A lot of people, especially engineers are now wondering if these technologies could be used in HealthCare and BioSciences.

More and more job openings outside the Social Networks or SEO sphere now mention MapReduce and Hadoop in their required or "nice to have" skills, including HealthCare and BioScience companies. In fact, recently at a talk from Bay Area Chapter of the ACM on Hadoop and Hive, even though the talk was quite technical, there were few venture capitalists in the crowd who were checking if this the topic was only hype or would potentially bring big ROI. Healthcare and biotechnologies were definitively in their mind.

Why then would the MapReduce paradigm be a good candidate to provide the "next quantum leap" for HealthCare and BioSciences?

In HealthCare, as more and more users, patients and professionals upload data to applications such as PHRs and EMRs, there is a need to parse, clean and reconcile extremely large amount of data that might be initially stored in log files. Medical observations from patients with chronic diseases such as blood pressure or blood glucose might be good candidates for this, especially when they are uploaded automatically from medical devices. Also the aggregation of data coming from potentially large numbers of sources makes it more suitable to a Map and Reduce processing paradigm than DBMS based data mining tasks.

HealthCare decision makers might be hesitant to use these new technologies as long as they have some concerns related to security, confidentiality and certification to standards such as HL7 (see CCIT and HITSP). However with the overall reforms in progress in HealthCare it will be interesting to see if MapReduce will be part of the technical package for the benefits of not only the patient and care givers, but all healthcare actors including payers and various service providers.

BioSciences (drug discovery, meta-genomics, bioassay activities ...) is also a good candidate for MapReduce. In addition to the fact that BioScience applications deal also with large amount of data (e.g. biological sequences such as DNA, RNA, proteins) a lot of the data is semi-structured data that is semantically rich and most likely best represented as a RDF data model than a Database set of tables (e.g. see "Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce") . Even though database has made progress to store and process XML, MapReduce is more suitable to very fast processing and aggregation of large amount of key-value elements.

Another element is price and return on investment (ROI), especially for startups is the fact that the implementation of MapReduce over a cloud based infrastructure using an open source framework such as Hadoop and Hive can be an attractive economic proposition for a CTO.

Also both fields can also take advantage of other applications of MapReduce in areas other than hard-core technology but related to brand management, sales and supply chain optimizations used with success in other domains.




Thursday, January 28, 2010

Cloudera & Facebook on Hadoop and Hive



This week I attended a very interesting meeting of the San Francisco Bay Area Chapter of the ACM on the topics of Hadoop and HIVE. I was not the only one interested by MapReduce related projects, since the meeting nicely hosted by LinkedIn at their office of Mountain View, had more than 250 people.






Dr. Amr Awadallah from Cloudera did a very good introduction to Hadoop since a lot of attendees were not very familiar with this java open source version of MapReduce. It is interesting to mention that Desktop product offered by Cloudera is free. Amr explained that Cloudera business model is to offer professional services, training and specific for fees features out of the core of the main product.


Cloudera web site has a lot of good training material on Hadoop and MapReduce. Amr mentioned for example that Hadoop was used at LinkedIn to create and store the recommendations on the fly "People you may know" whereas the profile information is managed by a more traditional RDBMS data store.


They were a couple of questions related to the behavior of Hadoop on top of full virtualization products such as those offered by VMWare. The answer from Amr was first to compare the virtualization of platforms and the parallelism involved in MapReduce/Hadoop. In a way the former architecture goal is to have multiple virtual machines running on the same hardware (e.g. a large mainframe or blade boxes) whereas the later is to have an initial processing and storing job done on multiple cheap commodity two rack units (RU) “pizza” boxes at the same time. So in a way these architectures are completely opposite. Of course it is not fair to try to compare the complete virtualization of complete operating systems such as Windows or Linux and the management of basic map and reduce operations even though they have common characteristics (a file system and some processing capabilities).


However some people do use VMWare images clusters to run Hadoop MapReduce tasks and the question is “is it efficient?”. The answer lies in the way network performance and I/O in general is handled by both the images and the Hadoop scripts.


They was also an interesting question about the fact the Google has several patents on MapReduce this might be an obstacle to the development of open source product on top of hadoop. Amr did not seem to really worry about this.

The second presentation was from Ashish Thusoo from Facebook. Some interesting numbers and statistics about the volume of data processed everyday by Facebook (e.g. already 200GB/day in march 2008). Ashish pointed out that it was more interesting for Facebook to have simple algorithms running on large amount of data than complex data mining algorithms running on small volumes. The benefits were more important and the company was learning much more on their users behaviors and profiles. It was back in 2008 that Facebook started to experiment with MapReduce and Hadoop as an alternative to very expensive existing data mining solutions. One of the issue with Hadoop was the complexity of development and the lack of skill among its teams. This is why Facebook started to look at ways to wrap Hadoop in a more SQL like friendly layer. The result is HIVE which is now open source, although Facebook has some proprietary components, especially on the UI side.

There were some good questions about data skew issues with Hive and Hadoop as well as comparison between HIVE and ASTER. Like Amr did with virtualization and Hadop, Ashish tried to oppose both approaches in simple terms: in a way ASTER is MapReduce applied on top of a RDBMS layer whereas HIVE is a RDBMS layer running on top of MapReduce.

Both presentations:
  • Hadoop: Distributed Data Processing (Amr Awadallah)
  • Facebook’s Petabyte Scale Data Warehouse (Ashish Thusoo)
are available as PDF files on the ACM web site.