Predictive Analytics – Multi Variate Linear Regression with R and Netezza

Linear Regression (Multi Variate)

In this blog, we will look at an example of Multi Variate Linear Regression running on R and Netezza.

What is Linear regression
It is a method of investigating functional relationship between variables. It tries to estimate the value of dependent variables from the values of independent variables using a linear equation. Regression analysis is typically used when dependent and independent variables are continuous and have some co relation.
Example of a Simple Linear Equation

Linear_regression_pic1

Y = ãX + ß

The above plot shows a simple linear equation where we only have one variable X, which we are using to find the value of Y.

ã is called the slope which is Y/X

ß is the intercept which is the value of Y when X=0

In a multi variate linear regression, we have multiple independent variables x1, x2…xn which we use to calculate the value of variable Y. It can be expressed in the form

Y = x1ã1 +x2ã2 +x3ã3…….x1ãn + ß

As you can see here, different predictors (x1, x2 etc) can have different levels of impact (indicated by the values of the slopes) on the dependent variable Y.

In this case ã1, ã2, ã3, ãn are the slopes

ß is the intercept

Fitting a line in Linear regression

Given a scatter plot of Y vs X, fit a straight line through the points so that the sum of square of vertical distances between the points and the line is minimized.

A linear regression algorithm will try to fit a line that will give the least residuals. Residuals is the sum of square of vertical distances between the points.

If you look at the graph above, the closer the point (red dots) to the blue line, better is the fit and hence better is the accuracy of the linear regression algorithm.

Goodness of fit – R-squared is a measure which tells us how close the data is to the fitted line. It goes from values 0 to 1. The higher the value, the better is the fit.

Using Linear Regression for Machine Learning

You feed the algorithm the data to train the system. This is call training. In training you will have the values of both the predictors and target variables. The algorithm will then create a linear equation to model the behavior of the training data. Then you will feed the model the testing data which will only have the predictors. The model with then predict the target variable values based on the linear equation it developed. The algorithm will also provide metrics that can be used to calculate the accuracy of the equation, in this case R-squared values.

Advantages of linear regression

  1. Fast computations
  2. Good accuracy for continuous variables
  3. Very good for linear relationships

Disadvantages

  1. Highly sensitive to outliers
  2. Cannot model nonlinear relationships
  3. Only numeric and continuous variables can be used

 

Problem statement:

The input data set contains data about details of various car models. Based on the information provided, the goal is to come up with a model to predict Miles-per-gallon of a given model.

The open source data can be downloaded from

https://archive.ics.uci.edu/ml/datasets/Auto+MPG

Techniques used:

  1. Linear Regression – Multi Variate
  2. Data Imputation

Data Engineering and Analysis:

Setting the R environment

A working directory is set on the R studio

 setwd(“C:/Users/abhik/Documents/OpenR on Netezza/Netezza R case study”)

> getwd()

[1] “C:/Users/abhik/Documents/OpenR on Netezza/Netezza R case study”

Load the Netezza specific R  libraries in the R studio

 ibrary(nzr)

Loading required package: RODBC

Warning messages:

1: package ‘RODBC’ was built under R version 3.2.2

2: package ‘nzr’ was built under R version 3.1.0

> library(nza)

A connection is established to the Netezza database called ‘dbatest’ from the R studio

nzConnectDSN(‘DBATEST’, force = TRUE, verbose = TRUE)

On the spus: R 3.2.0

On the host: R 3.2.0

To disconnect use

nzDisconnect()

Test Data

The below table was created in Netezza  database ‘dbatest’ to hold the input data

———————————————————————-

create table auto_miles_per_gallon

(

Id integer

mgp double,

cylinders double,

displacement double,

horsepower double,

weight double,

acceleration double,

modelyear double,

name varchar(200)

——————————————————————–

 A quick look at the test data

 DBATEST.ADMIN(ADMIN)=> select * from auto_miles_per_gallon where horsepower is NULL;

ID  | MGP  | CYLINDERS | DISPLACEMENT | HORSEPOWER | WEIGHT | ACCELERATION | MODELYEAR |         NAME

—–+——+———–+————–+————+——–+————–+———–+———————-

127 |   21 |             6 |                         200                  |            |                 2875 |           17 |                    74 | ford maverick

337 | 23.6 |         4 |                           140                   |            |                  2905 |         14.3 |                 80 | ford mustang cobra

33 |   25 |              4 |                           98                   |            |                 2046 |           19 |                     71 | ford pinto

331 | 40.9 |         4   |                           85                   |            |                1835 |         17.3 |                   80 | renault lecar deluxe

355 | 34.5 |         4 |                          100                    |            |                  2320 |         15.8 |                  81 | renault 18i

375 |   23 |           4 |                           151                  |            |                  3035 |         20.5 |                   82 | amc concord dl

————————————————————————————————————————-

As we can see above, there are null values in the test data for the HORSEPOWER column, which are the equivalent of ‘NA’ s in R and hence must be imputed with some values.

We create a Netezza and a local R dataframe to map the test table and start looking at the data from the R perspective. The below commands are run from the R studio

> nz_auto_miles_per_gallon = nz.data.frame(“auto_miles_per_gallon”)

> reg_df <- as.data.frame(nz_auto_miles_per_gallon)

A look at the nz_auto_miles_per_gallon data frame

> nz_auto_miles_per_gallon

SELECT “ID”,”MGP”,”CYLINDERS”,”DISPLACEMENT”,”HORSEPOWER”,”WEIGHT”,”ACCELERATION”,”MODELYEAR”,”NAME” FROM AUTO_MILES_PER_GALLON_ID

> summary(nz_auto_miles_per_gallon)

Length         Class          Mode

9 nz.data.frame            S4

A look at the reg_df  dataframe which is an R dataframe which shows the quadrille information of the data.

> summary(reg_df)

ID             MGP          CYLINDERS      DISPLACEMENT     HORSEPOWER

Min.   :  1.0   Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0

1st Qu.:100.2   1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   1st Qu.: 75.0

Median :199.5   Median :23.00   Median :4.000   Median :148.5   Median : 93.5

Mean   :199.5   Mean   :23.51   Mean   :5.455   Mean   :193.4   Mean   :104.5

3rd Qu.:298.8   3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   3rd Qu.:126.0

Max.   :398.0   Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0

NA’s   :6

WEIGHT      ACCELERATION     MODELYEAR         NAME

Min.   :1613   Min.   : 8.00   Min.   :70.00   Length:398

1st Qu.:2224   1st Qu.:13.82   1st Qu.:73.00   Class :character

Median :2804   Median :15.50   Median :76.00   Mode  :character

Mean   :2970   Mean   :15.57   Mean   :76.01

3rd Qu.:3608   3rd Qu.:17.18   3rd Qu.:79.00

Max.   :5140   Max.   :24.80   Max.   :82.00

 

> str(reg_df)

‘data.frame’:                   398 obs. of  9 variables:

$ ID          : int  1 2 3 4 5 6 7 8 9 10 …

$ MGP         : num  18 15 18 16 17 15 14 14 14 15 …

$ CYLINDERS   : num  8 8 8 8 8 8 8 8 8 8 …

$ DISPLACEMENT: num  307 350 318 304 302 429 454 440 455 390 …

$ HORSEPOWER  : num  130 165 150 150 140 198 220 215 225 190 …

$ WEIGHT      : num  3504 3693 3436 3433 3449 …

$ ACCELERATION: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 …

$ MODELYEAR   : num  70 70 70 70 70 70 70 70 70 70 …

$ NAME        : chr  “chevrolet chevelle malibu” “buick skylark 320” “plymouth satellite” “amc rebel sst” …

> unlist(lapply(reg_df, function(x) any(is.na(x))))

ID          MGP    CYLINDERS DISPLACEMENT   HORSEPOWER       WEIGHT

FALSE        FALSE        FALSE        FALSE         TRUE        FALSE

ACCELERATION    MODELYEAR         NAME

FALSE        FALSE        FALSE

As we can see above, the Horsepower column has “NA” or “NULL” values and we will use Netezza Analytic’s inbuilt data imputation function to impute mean values to the missing data. The  data imputation process is pushed down to the Netezza SPUs so we obtain  parallel and in database processing  here. So this process can be scaled out to massive data sets without having to pull the dataset out of Netezza database to perform imputations outside the database engine.

 

> t=nzQuery(“EXECUTE NZA..IMPUTE_DATA(‘intable=auto_miles_per_gallon, method=mean, outtable=auto_miles_per_gallon_2, inColumn=horsepower’)”)

> head(t)

IMPUTE_DATA

1           1

The above method will impute the missing values in the horsepower column with mean values so regression analysis can be performed. A new table called auto_miles_per_gallon_2 is created and the data from the original table along with the imputed data is loaded into it by Netezza.

nz_auto_miles_per_gallon_2 = nz.data.frame(“auto_miles_per_gallon_2”)

reg_df_2 <- as.data.frame(nz_auto_miles_per_gallon_2)

unlist(lapply(reg_df_2, function(x) any(is.na(x))))

ID  MGP    CYLINDERS DISPLACEMENT   HORSEPOWER       WEIGHT ACCELERATION

FALSE  FALSE        FALSE        FALSE        FALSE        FALSE        FALSE

————————————————————————————————————–

We can see above that now there are no more NULL values in the horsepower column, so we can continue to the next step of building the Linear Regression Model.

We will take a quick graphical look at the data. This also shows how easily we can analyze and transform data in the Netezza database and perform instant data visualizations from the R studio console.

library(ggplot2)

ggplot(reg_df_2, aes(factor(CYLINDERS), MGP)) +geom_boxplot( aes(fill=factor(CYLINDERS)))

linear_regression_pic2

Understanding co relations:

Here we are using Netezza’s NZAnalytics co relation function to find the  pearson’s co relation co efficient  between cylinders and weight. This example shows show we can utilize Netezza’s built in Analytic functions

> t=nzQuery (“SELECT nza..CORR_AGG(cylinders,weight) from auto_miles_per_gallon_2”)

> t

CORR_AGG

1 0.8960168

—————————————————————————————————–

Using R psyc  package to find Pearson’s co relation co efficients

library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:ggplot2’:

Warning message:

package ‘psych’ was built under R version 3.2.2
pairs.panels(reg_df_2)

linear_regression_pic3

Modeling and Prediction:

Generalized Linear Model in Netezza (an alternative to the lm_model CRAN package)

Please note: In order for the generalized model to work, the input table must have a unique identifier column. Make sure to add one, if your table does not have it.

The below command is used to generate Linear Regression Model

glmfit <-nzGlm(MGP ~ CYLINDERS+DISPLACEMENT+HORSEPOWER+WEIGHT+ACCELERATION+MODELYEAR, nz_auto_miles_per_gallon_id_2 , id=”ID”, family=”gaussian”, link=”identity”, method=’irls’)

Taking a look at the generated model

————————————————————————————————

print(glmfit)
Model Name
AUTO_MILES_PER_GALLON_ID_2_MODEL92860
Call:nzGlm(form = MGP ~ CYLINDERS + DISPLACEMENT + HORSEPOWER + WEIGHT +
ACCELERATION + MODELYEAR, data = nz_auto_miles_per_gallon_id_2,
id = “ID”, family = “gaussian”, link = “identity”, method = “irls”)
Coefficients:
INTERCEPT ACCELERATION CYLINDERS DISPLACEMENT HORSEPOWER MODELYEAR
0.005629164 -0.050536985 0.014586362 0.002332460 -0.017306055 0.597021079
WEIGHT
-0.006660499
Residuals Summary:
Pearson: RSS: 4786.2168380755 df: 391 p-value: 1
Deviance: RSS: 4786.2168380755 df: 391 p-value: 1

——————————————————————————————————————-

summary(glmfit)

Call:nzGlm(form = MGP ~ CYLINDERS + DISPLACEMENT + HORSEPOWER + WEIGHT +

ACCELERATION + MODELYEAR, data = nz_auto_miles_per_gallon_id_2,

id = “ID”, family = “gaussian”, link = “identity”, method = “irls”)

GLM coefficients for model: “AUTO_MILES_PER_GALLON_ID_2_MODEL92860”

| Parameter           | Beta                      | Std Error              | Test                      | p-value |

| INTERCEPT         | 0.005629            | 0.000669            | 8.417371            | 0           |

| ACCELERATION  | -0.050537           | 0.090939            | -0.555723           | 0.5784 |

| CYLINDERS          | 0.014586            | 0.000797            | 18.297005          | 0           |

| DISPLACEMENT | 0.002332            | 0.005524            | 0.42227               | 0.672828            |

| HORSEPOWER    | -0.017306           | 0.011938            | -1.449698           | 0.147143            |

| MODELYEAR       | 0.597021            | 0.022447            | 26.597039          | 0           |

| WEIGHT               | -0.00666             | 0.000657            | -10.138912         | 0           |

Residuals Summary:

| Residual Type     | RSS                       | df | p-value          |

| Pearson                |4786.2168380755|391|1|

| Deviance              |4786.2168380755|391|1|

—————————————————————————————————————

resid < – residuals(glmfit)

head(resid)

1          2          3          4          5          6

1.5647598  0.3037419  1.3817729 -0.5550171  0.3073497  4.9307754

—————————————————————————————————–

> fit <-fitted(glmfit)

> head(fit)

1        2        3        4        5        6

16.43524 14.69626 16.61823 16.55502 16.69265 10.06922

 ———————————————————————————————–

Note: The metadata of the models generated by calling NZ Analytics packages are stored in the Netezza database. There are various utilities for extensive model management like copying the model to a file, running a prediction from a saved model etc. The details of these are beyond the scope of this document. However, it should be noted that being able to store the models in the Netezza database gives a centralized repository, which is an added bonus.

Prediction:

Now that we developed the regression model, we will predict the values of MPG using the same sample data.

————————————————————————————————————————

> pred = predict(glmfit, nz_auto_miles_per_gallon_id_2, “ID”)

> head(pred)

ID     PRED

145 31.70386

385 34.19114

87 16.77949

327 30.45490

144 27.06798

384 34.22652

——————————————————————————————————————

Here are how the actual values look

145        31           4             76           52           1649     16.5       74           toyota corona

385        32           4             91           67           1965     15.7       82           honda civic (auto)

87           14           8             304        150        3672     11.5       73           amc matador

327        43.4       4             90           48           2335     23.7       80           vw dasher (diesel)

144        26           4             97           78           2300     14.5       74           opel manta

384        38           4             91           67           1965     15           82           honda civic

————————————————————————————————————————

Conclusions:

  1. NZ Analytics built in regression algorithm provides comparable model accuracy compared to other popular R packages.
  2. We were able to push down data imputation (a part of data engineering) and model creation and prediction process down to the Netezza database making this a truly scalable process.

 

Author – Abhik Roy

 

 

 

 

16 thoughts on “Predictive Analytics – Multi Variate Linear Regression with R and Netezza

  1. I’ve been exploring for a little bit for any high-quality articles or blog posts on this sort of area . Exploring in Yahoo I at last stumbled upon this web site. Reading this info So i am happy to convey that I have an incredibly good uncanny feeling I discovered just what I needed. I most certainly will make sure to do not forget this website and give it a look on a constant basis.

  2. When someone writes an article he/she maintains the thought of a user within his/her brain that the way a user can understand it.
    So that’s why this article is perfect. Thanks!

  3. Keep up the great piece of work, I read few posts on this site and I conceive that your web site is very interesting and has bands of fantastic information.

  4. It?¦s actually a nice and helpful piece of info. I am satisfied that you just shared this useful information with us. Please stay us up to date like this. Thank you for sharing.

  5. Howdy! This post could not be written any better! Reading through this post
    reminds me of my good old room mate! He always kept talking about this.
    I will forward this write-up to him. Pretty sure he will have a good read.
    Thank you for sharing!

  6. Hello I am so excited I found your blog page, I really found
    you by mistake, while I was searching on Digg for something else, Anyways I am
    here now and would just like to say kudos for a marvelous post and a all round thrilling blog (I also love the theme/design), I don’t have time to
    look over it all at the minute but I have bookmarked it
    and also added your RSS feeds, so when I have time I will be back to read more,
    Please do keep up the great job.

  7. This is the right site for anybody who wishes to understand this topic.
    You realize so much its almost hard to argue with
    you (not that I personally will need toHaHa).
    You certainly put a brand new spin on a subject that has been written about for decades.
    Wonderful stuff, just excellent!

Leave a Reply

Your email address will not be published. Required fields are marked *