Monthly Archives: December 2015

Leveraging HADOOP to Massively Scale R Machine Learning computations on Very Large Data Sets

Development of technology to massively scale R Machine Learning and Data Engineering processes is still at its infancy.  R users are limited by the memory capacity of the machines where they are traditionally run. This has caused many highly accurate R machine learning libraries and programs to be shelved and re written  in other programming languages  to make them more scalable and adopt to today’s big data infrastructures.  However there are now some technology initiatives that have started to make great progress in this direction.

In this blog post, I will mention how you could use IBM’s BigInsights and BigR value added big data services to push R processing to a HADOOP cluster. The aim is to develop a distributed computing environment utilizing Apache HADOOP and HDFS technology that is capable of running many simultaneous machine learning algorithms on massive data sets.

Below is the reference architecture diagram of an implementation that I would use to explain how such a computing environment could be developed. Please note that there are multiple design options possible and the number of management and data nodes would all depend on the size of processing environment, workload characteristics and availability requirements. It is best to contact IBM to help you with the actual capacity and sizing of the clusters.

BigR_pic1

As seen from the above diagram, the R on HADOOP processing environment built using IBM’s BigInsights  consists of many different components. I have highlighted IBM’s value added services in blue while the open source components are in black. Now let’s look at how a typical setup would look like.

IBM BigInsights:

This is a bundle of various big data projects and services containing a mix of open source components like HIVE, HBASE, HDFS, HADOOP map reduce, flume, Ambari  etc along with IBM’s value added services like BigR, Big Sheets and BigSQL to name a few.

R users:

R_logo

The users of R would typically install base R and R studio in their workstations. They would also download the CRAN R package called BigR, which is the package used to push down the processing and computations into the HADOOP cluster. The BigR package contains built in implementations of many standard algorithms and data transformation functions. When the user loads the bigR library in their R environment, it automatically loads rJava and RODBC packages. rJava initializes the R process to have a JVM which is used to submit jobs to the HADOOP cluster. JAQL is used to run R requests onto the BigInsights cluster.

Management Node:

The management node of the BigInsights cluster runs the BigInsights BigR client and BigInsights BigR connector value added services. This node receives the R jobs submitted by the clients and sends them to the data nodes  for execution.  The HBASE client, HBASE master, HIVE metastore and HIVE client services running on the management node enables the R programs to work on data sets stored in native HDFS file systems as well as HIVE and HBASE data stores.

Data Processing Nodes:

The BigInsights BigR client, native R, data node, HBASE client, region server and HIVE client processes running on these servers empowers  the infrastructure to run R programs on data stored in HDFS file systems and well as access HIVE and HBASE data stores.

The below diagram provides a snapshot of some of the R functions and processes that can achieve total pushdown of processing to the HADOOP processing environment with the use of BigR.

BigR_pic2

Advantages of using BigR:

  1. BigR provides canned built in algorithms that can push down R processing to distributed computing HADOOP environments. In addition to that, we also have the ability to develop custom R packages and programs to handle unique programming requirements.
  2. BigR ensures that your R programs are truly scalable. The same R programs can be run on any size of data sets. Re programming is not needed to handle larger data sets.
  3. You can perform data engineering, data standardization, visualization, model creation and prediction though a single R interface.
  4. The BigR technology is backed up by many years of IBM research.
  5. IBM uses a cost based optimizer to come up with optimum access plans for the map reduce jobs based on data set size, characteristics and Hadoop configurations.
  6. If your shop has data scientists already using R, you will be delighted to see how easily existing R programs can be made cluster aware and scalable by leveraging BigR.