Friday, February 26, 2010

MapReduce an opportunity for Health and BioSciences Applications?

HealthCare and BioScience software products and solutions have embraced Database Management System (DBMS) for their back-end storage and processing for years like most other domains where performance, scalability, security, extensibility, auditing capabilities and maintenance are critical.

In the past few years with alternative or complement technologies such as MapReduce and Hive originally created from the need of extremely high volume web applications such as Google, Facebook or LinkedIn. A lot of people, especially engineers are now wondering if these technologies could be used in HealthCare and BioSciences.

More and more job openings outside the Social Networks or SEO sphere now mention MapReduce and Hadoop in their required or "nice to have" skills, including HealthCare and BioScience companies. In fact, recently at a talk from Bay Area Chapter of the ACM on Hadoop and Hive, even though the talk was quite technical, there were few venture capitalists in the crowd who were checking if this the topic was only hype or would potentially bring big ROI. Healthcare and biotechnologies were definitively in their mind.

Why then would the MapReduce paradigm be a good candidate to provide the "next quantum leap" for HealthCare and BioSciences?

In HealthCare, as more and more users, patients and professionals upload data to applications such as PHRs and EMRs, there is a need to parse, clean and reconcile extremely large amount of data that might be initially stored in log files. Medical observations from patients with chronic diseases such as blood pressure or blood glucose might be good candidates for this, especially when they are uploaded automatically from medical devices. Also the aggregation of data coming from potentially large numbers of sources makes it more suitable to a Map and Reduce processing paradigm than DBMS based data mining tasks.

HealthCare decision makers might be hesitant to use these new technologies as long as they have some concerns related to security, confidentiality and certification to standards such as HL7 (see CCIT and HITSP). However with the overall reforms in progress in HealthCare it will be interesting to see if MapReduce will be part of the technical package for the benefits of not only the patient and care givers, but all healthcare actors including payers and various service providers.

BioSciences (drug discovery, meta-genomics, bioassay activities ...) is also a good candidate for MapReduce. In addition to the fact that BioScience applications deal also with large amount of data (e.g. biological sequences such as DNA, RNA, proteins) a lot of the data is semi-structured data that is semantically rich and most likely best represented as a RDF data model than a Database set of tables (e.g. see "Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce") . Even though database has made progress to store and process XML, MapReduce is more suitable to very fast processing and aggregation of large amount of key-value elements.

Another element is price and return on investment (ROI), especially for startups is the fact that the implementation of MapReduce over a cloud based infrastructure using an open source framework such as Hadoop and Hive can be an attractive economic proposition for a CTO.

Also both fields can also take advantage of other applications of MapReduce in areas other than hard-core technology but related to brand management, sales and supply chain optimizations used with success in other domains.