**Linear Regression (Multi Variate)**

In this blog, we will look at an example of Multi Variate Linear Regression running on R and Netezza.

**What is Linear regression**

It is a method of investigating functional relationship between variables. It tries to estimate the value of dependent variables from the values of independent variables using a linear equation. Regression analysis is typically used when dependent and independent variables are continuous and have some co relation.

Example of a Simple Linear Equation

Y = ãX + ß

The above plot shows a simple linear equation where we only have one variable X, which we are using to find the value of Y.

ã is called the slope which is Y/X

ß is the intercept which is the value of Y when X=0

In a multi variate linear regression, we have multiple independent variables x1, x2…xn which we use to calculate the value of variable Y. It can be expressed in the form

Y = x1ã1 +x2ã2 +x3ã3…….x1ãn + ß

As you can see here, different predictors (x1, x2 etc) can have different levels of impact (indicated by the values of the slopes) on the dependent variable Y.

In this case ã1, ã2, ã3, ãn are the slopes

ß is the intercept

**Fitting a line in Linear regression**

Given a scatter plot of Y vs X, fit a straight line through the points so that the sum of square of vertical distances between the points and the line is minimized.

A linear regression algorithm will try to fit a line that will give the least residuals. Residuals is the sum of square of vertical distances between the points.

If you look at the graph above, the closer the point (red dots) to the blue line, better is the fit and hence better is the accuracy of the linear regression algorithm.

Goodness of fit – R-squared is a measure which tells us how close the data is to the fitted line. It goes from values 0 to 1. The higher the value, the better is the fit.

**Using Linear Regression for Machine Learning**

You feed the algorithm the data to train the system. This is call training. In training you will have the values of both the predictors and target variables. The algorithm will then create a linear equation to model the behavior of the training data. Then you will feed the model the testing data which will only have the predictors. The model with then predict the target variable values based on the linear equation it developed. The algorithm will also provide metrics that can be used to calculate the accuracy of the equation, in this case R-squared values.

**Advantages of linear regression**

- Fast computations
- Good accuracy for continuous variables
- Very good for linear relationships

**Disadvantages**

- Highly sensitive to outliers
- Cannot model nonlinear relationships
- Only numeric and continuous variables can be used

** **

**Problem statement:**

The input data set contains data about details of various car models. Based on the information provided, the goal is to come up with a model to predict Miles-per-gallon of a given model.

The open source data can be downloaded from

https://archive.ics.uci.edu/ml/datasets/Auto+MPG

**Techniques used:**

- Linear Regression – Multi Variate
- Data Imputation

**Data Engineering and Analysis:**

Setting the R environment

A working directory is set on the R studio

** ****setwd(“C:/Users/abhik/Documents/OpenR on Netezza/Netezza R case study”)**

> **getwd()**

[1] “C:/Users/abhik/Documents/OpenR on Netezza/Netezza R case study”

**Load the Netezza specific R libraries in the R studio**

** ****ibrary(nzr)**

Loading required package: RODBC

Warning messages:

1: package ‘RODBC’ was built under R version 3.2.2

2: package ‘nzr’ was built under R version 3.1.0

**> library(nza)**

A connection is established to the Netezza database called ‘dbatest’ from the R studio

**nzConnectDSN(‘DBATEST’, force = TRUE, verbose = TRUE)**

On the spus: R 3.2.0

On the host: R 3.2.0

To disconnect use

**nzDisconnect()**

**Test Data**

The below table was created in Netezza database ‘dbatest’ to hold the input data

———————————————————————-

create table auto_miles_per_gallon

(

Id integer

mgp double,

cylinders double,

displacement double,

horsepower double,

weight double,

acceleration double,

modelyear double,

name varchar(200)

——————————————————————–

** A quick look at the test data**

** ****DBATEST.ADMIN(ADMIN)=> select * from auto_miles_per_gallon where horsepower is NULL;**

ID | MGP | CYLINDERS | DISPLACEMENT | HORSEPOWER | WEIGHT | ACCELERATION | MODELYEAR | NAME

—–+——+———–+————–+————+——–+————–+———–+———————-

127 | 21 | 6 | 200 | | 2875 | 17 | 74 | ford maverick

337 | 23.6 | 4 | 140 | | 2905 | 14.3 | 80 | ford mustang cobra

33 | 25 | 4 | 98 | | 2046 | 19 | 71 | ford pinto

331 | 40.9 | 4 | 85 | | 1835 | 17.3 | 80 | renault lecar deluxe

355 | 34.5 | 4 | 100 | | 2320 | 15.8 | 81 | renault 18i

375 | 23 | 4 | 151 | | 3035 | 20.5 | 82 | amc concord dl

————————————————————————————————————————-

As we can see above, there are null values in the test data for the HORSEPOWER column, which are the equivalent of ‘NA’ s in R and hence must be imputed with some values.

We create a Netezza and a local R dataframe to map the test table and start looking at the data from the R perspective. The below commands are run from the R studio

**> nz_auto_miles_per_gallon = nz.data.frame(“auto_miles_per_gallon”)**

**> reg_df <- as.data.frame(nz_auto_miles_per_gallon)**

A look at the nz_auto_miles_per_gallon data frame

**> nz_auto_miles_per_gallon**

SELECT “ID”,”MGP”,”CYLINDERS”,”DISPLACEMENT”,”HORSEPOWER”,”WEIGHT”,”ACCELERATION”,”MODELYEAR”,”NAME” FROM AUTO_MILES_PER_GALLON_ID

**> summary(nz_auto_miles_per_gallon)**

Length Class Mode

9 nz.data.frame S4

A look at the reg_df dataframe which is an R dataframe which shows the quadrille information of the data.

**> ****summary(reg_df)**

ID MGP CYLINDERS DISPLACEMENT HORSEPOWER

Min. : 1.0 Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0

1st Qu.:100.2 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 1st Qu.: 75.0

Median :199.5 Median :23.00 Median :4.000 Median :148.5 Median : 93.5

Mean :199.5 Mean :23.51 Mean :5.455 Mean :193.4 Mean :104.5

3rd Qu.:298.8 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0 3rd Qu.:126.0

Max. :398.0 Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0

NA’s :6

WEIGHT ACCELERATION MODELYEAR NAME

Min. :1613 Min. : 8.00 Min. :70.00 Length:398

1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 Class :character

Median :2804 Median :15.50 Median :76.00 Mode :character

Mean :2970 Mean :15.57 Mean :76.01

3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00

Max. :5140 Max. :24.80 Max. :82.00

> **str(reg_df)**

‘data.frame’: 398 obs. of 9 variables:

$ ID : int 1 2 3 4 5 6 7 8 9 10 …

$ MGP : num 18 15 18 16 17 15 14 14 14 15 …

$ CYLINDERS : num 8 8 8 8 8 8 8 8 8 8 …

$ DISPLACEMENT: num 307 350 318 304 302 429 454 440 455 390 …

$ HORSEPOWER : num 130 165 150 150 140 198 220 215 225 190 …

$ WEIGHT : num 3504 3693 3436 3433 3449 …

$ ACCELERATION: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 …

$ MODELYEAR : num 70 70 70 70 70 70 70 70 70 70 …

$ NAME : chr “chevrolet chevelle malibu” “buick skylark 320” “plymouth satellite” “amc rebel sst” …

> **unlist(lapply(reg_df, function(x) any(is.na(x)))) **

ID MGP CYLINDERS DISPLACEMENT HORSEPOWER WEIGHT

FALSE FALSE FALSE FALSE **TRUE ** FALSE

ACCELERATION MODELYEAR NAME

FALSE FALSE FALSE

As we can see above, the Horsepower column has “NA” or “NULL” values and we will use **Netezza Analytic’s inbuilt data imputation function** to impute mean values to the missing data. The data imputation process is pushed down to the Netezza SPUs so we obtain parallel and in database processing here. So this process can be scaled out to massive data sets without having to pull the dataset out of Netezza database to perform imputations outside the database engine.

** **

**> t=nzQuery(“EXECUTE NZA..IMPUTE_DATA(‘intable=auto_miles_per_gallon, method=mean,** **outtable=auto_miles_per_gallon_2, inColumn=horsepower’)”)**

**> head(t)**

IMPUTE_DATA

1 1

The above method will impute the missing values in the horsepower column with mean values so regression analysis can be performed. A new table called auto_miles_per_gallon_2 is created and the data from the original table along with the imputed data is loaded into it by Netezza.

**nz_auto_miles_per_gallon_2 = nz.data.frame(“auto_miles_per_gallon_2”)**

**reg_df_2 <- as.data.frame(nz_auto_miles_per_gallon_2)**

**unlist(lapply(reg_df_2, function(x) any(is.na(x)))) **

ID MGP CYLINDERS DISPLACEMENT HORSEPOWER WEIGHT ACCELERATION

FALSE FALSE FALSE FALSE FALSE FALSE FALSE

————————————————————————————————————–

We can see above that now there are no more NULL values in the horsepower column, so we can continue to the next step of building the Linear Regression Model.

We will take a quick graphical look at the data. This also shows how easily we can analyze and transform data in the Netezza database and perform instant data visualizations from the R studio console.

**library****(ggplot2)**

**ggplot****(****reg_df_2****, ****aes****(****factor****(CYLINDERS), MGP)) +****geom_boxplot****( ****aes****(fill=****factor****(CYLINDERS)))**

**Understanding co relations:**

Here we are using **Netezza’s NZAnalytics co relation function** to find the pearson’s co relation co efficient between cylinders and weight. This example shows show we can utilize Netezza’s built in Analytic functions

**> t=nzQuery (“SELECT nza..CORR_AGG(cylinders,weight) from auto_miles_per_gallon_2”)**

**> t**

CORR_AGG

1 0.8960168

—————————————————————————————————–

**Using R psyc package to find Pearson’s co relation co efficients**

library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:ggplot2’:

Warning message:

package ‘psych’ was built under R version 3.2.2

**pairs.panels(reg_df_2)**

**Modeling and Prediction:**

Generalized Linear Model in Netezza (an alternative to the lm_model CRAN package)

Please note: In order for the generalized model to work, the input table must have a unique identifier column. Make sure to add one, if your table does not have it.

The below command is used to generate Linear Regression Model

**glmfit <-nzGlm(MGP ~ CYLINDERS+DISPLACEMENT+HORSEPOWER+WEIGHT+ACCELERATION+MODELYEAR, nz_auto_miles_per_gallon_id_2 , id=”ID”, family=”gaussian”, link=”identity”, method=’irls’)**

Taking a look at the generated model

————————————————————————————————

**print(glmfit)**

Model Name

AUTO_MILES_PER_GALLON_ID_2_MODEL92860

Call:nzGlm(form = MGP ~ CYLINDERS + DISPLACEMENT + HORSEPOWER + WEIGHT +

ACCELERATION + MODELYEAR, data = nz_auto_miles_per_gallon_id_2,

id = “ID”, family = “gaussian”, link = “identity”, method = “irls”)

Coefficients:

INTERCEPT ACCELERATION CYLINDERS DISPLACEMENT HORSEPOWER MODELYEAR

0.005629164 -0.050536985 0.014586362 0.002332460 -0.017306055 0.597021079

WEIGHT

-0.006660499

Residuals Summary:

Pearson: RSS: 4786.2168380755 df: 391 p-value: 1

Deviance: RSS: 4786.2168380755 df: 391 p-value: 1

——————————————————————————————————————-

**summary(glmfit)**

Call:nzGlm(form = MGP ~ CYLINDERS + DISPLACEMENT + HORSEPOWER + WEIGHT +

ACCELERATION + MODELYEAR, data = nz_auto_miles_per_gallon_id_2,

id = “ID”, family = “gaussian”, link = “identity”, method = “irls”)

GLM coefficients for model: “AUTO_MILES_PER_GALLON_ID_2_MODEL92860”

| Parameter | Beta | Std Error | Test | p-value |

| INTERCEPT | 0.005629 | 0.000669 | 8.417371 | 0 |

| ACCELERATION | -0.050537 | 0.090939 | -0.555723 | 0.5784 |

| CYLINDERS | 0.014586 | 0.000797 | 18.297005 | 0 |

| DISPLACEMENT | 0.002332 | 0.005524 | 0.42227 | 0.672828 |

| HORSEPOWER | -0.017306 | 0.011938 | -1.449698 | 0.147143 |

| MODELYEAR | 0.597021 | 0.022447 | 26.597039 | 0 |

| WEIGHT | -0.00666 | 0.000657 | -10.138912 | 0 |

Residuals Summary:

| Residual Type | RSS | df | p-value |

| Pearson |4786.2168380755|391|1|

| Deviance |4786.2168380755|391|1|

—————————————————————————————————————

**resid < – residuals(glmfit)**

**head(resid)**

1 2 3 4 5 6

1.5647598 0.3037419 1.3817729 -0.5550171 0.3073497 4.9307754

—————————————————————————————————–

**> fit <-fitted(glmfit)**

**> head(fit)**

1 2 3 4 5 6

16.43524 14.69626 16.61823 16.55502 16.69265 10.06922

** **———————————————————————————————–

Note: The metadata of the models generated by calling NZ Analytics packages are stored in the Netezza database. There are various utilities for extensive model management like copying the model to a file, running a prediction from a saved model etc. The details of these are beyond the scope of this document. However, it should be noted that being able to store the models in the Netezza database gives a centralized repository, which is an added bonus.

**Prediction:**

Now that we developed the regression model, we will predict the values of MPG using the same sample data.

————————————————————————————————————————

**> pred = predict(glmfit, nz_auto_miles_per_gallon_id_2, “ID”)**

**> head(pred) **

ID PRED

145 31.70386

385 34.19114

87 16.77949

327 30.45490

144 27.06798

384 34.22652

——————————————————————————————————————

**Here are how the actual values look**

145 31 4 76 52 1649 16.5 74 toyota corona

385 32 4 91 67 1965 15.7 82 honda civic (auto)

87 14 8 304 150 3672 11.5 73 amc matador

327 43.4 4 90 48 2335 23.7 80 vw dasher (diesel)

144 26 4 97 78 2300 14.5 74 opel manta

384 38 4 91 67 1965 15 82 honda civic

————————————————————————————————————————

**Conclusions:**

- NZ Analytics built in regression algorithm provides comparable model accuracy compared to other popular R packages.
- We were able to push down data imputation (a part of data engineering) and model creation and prediction process down to the Netezza database making this a truly scalable process.

Author – Abhik Roy

I’ve been exploring for a little bit for any high-quality articles or blog posts on this sort of area . Exploring in Yahoo I at last stumbled upon this web site. Reading this info So i am happy to convey that I have an incredibly good uncanny feeling I discovered just what I needed. I most certainly will make sure to do not forget this website and give it a look on a constant basis.

When someone writes an article he/she maintains the thought of a user within his/her brain that the way a user can understand it.

So that’s why this article is perfect. Thanks!

Fantastic site. Plenty of helpful information here. I am sending it to several pals ans additionally sharing in delicious. And certainly, thank you for your sweat!

Keep up the great piece of work, I read few posts on this site and I conceive that your web site is very interesting and has bands of fantastic information.

I dugg some of you post as I cerebrated they were very beneficial very helpful

Awsome blog! I am loving it!! Will be back later to read some more. I am taking your feeds also.

It?¦s actually a nice and helpful piece of info. I am satisfied that you just shared this useful information with us. Please stay us up to date like this. Thank you for sharing.

Its excellent as your other articles : D, appreciate it for posting. “Say not, ‘I have found the truth,’ but rather, ‘I have found a truth.'” by Kahlil Gibran.

Howdy! This post could not be written any better! Reading through this post

reminds me of my good old room mate! He always kept talking about this.

I will forward this write-up to him. Pretty sure he will have a good read.

Thank you for sharing!

Hello I am so excited I found your blog page, I really found

you by mistake, while I was searching on Digg for something else, Anyways I am

here now and would just like to say kudos for a marvelous post and a all round thrilling blog (I also love the theme/design), I don’t have time to

look over it all at the minute but I have bookmarked it

and also added your RSS feeds, so when I have time I will be back to read more,

Please do keep up the great job.

This is the right site for anybody who wishes to understand this topic.

You realize so much its almost hard to argue with

you (not that I personally will need toHaHa).

You certainly put a brand new spin on a subject that has been written about for decades.

Wonderful stuff, just excellent!

Thanks for th good info!

Hurrah, that’s the things i was exploring for, what a stuff!

existing here at this website, thanks admin on this website.

Wow cuz this is great work! Congrats and keep it up

Hi there colleagues, its great article regarding educationand fully explained, keep

it up all the time.

Hello! I just would like to give a huge thumbs up for the great info you have here on this post. I will be coming back to your blog for more soon.