Predictive Analytic s – Decision Trees using OpenR and IBM Netezza

Decision Trees

Decision tree learning uses a decision tree as a predictive model which maps observations about an item to conclusions about the item’s target value. It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a finite set of values are called classification trees. In these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

They are one of the simplest and easiest Machine Learning Techniques. Predictive variables are used to build a tree that would progressively predict the target variable.

A Decision tree has the following components

  • root node – That starts the decision making process
  • branch node – Refines the decision making process
  • Leaf Node – Provides the decisions

One of the challenges in the algorithm is to find the sequence of variables to be used to build the model.

However most of the ML implementations of decisions trees usually do a good job of choosing the correct sequence of variables.

Advantages of Decision Trees

  • Easy to explain
  • Works with missing variables
  • Sensitive to local variations
  • Model creation and prediction requires low compute resources

Disadvantages of Decision Trees

  • Limited accuracy
  • Cannot handle too many predictors
  • Bias could build up quickly in the models

Problem Statement

In this example, we will look at the ‘adult’ data set and create a decision tree to predict the variable income based on the variables age, sex, and hours per week. The adult data is stored in a table called adult in the netezza database named dbatest.

The adult data set can be downloaded from

https://archive.ics.uci.edu/ml/datasets/Adult

Data Engineering and Analysis

Create the required Netezza and R dataframes.

————————————————————————————

nzConnectDSN(‘DBATEST’, force = FALSE, verbose = TRUE)

nz_adult = nz.data.frame(“adult”)

reg_df <- as.data.frame(nz_adult)

————————————————————————————-

A glance at the data

————————————————————————————–

 

summary(reg_df)

ID             AGE         WORKCLASS             FNLWGT

Min.   :    1   Min.   :17.00   Length:32561       Min.   :  12285

1st Qu.: 8141   1st Qu.:28.00   Class :character   1st Qu.: 117827

Median :16281   Median :37.00   Mode  :character   Median : 178356

Mean   :16281   Mean   :38.58                      Mean   : 189778

3rd Qu.:24421   3rd Qu.:48.00                      3rd Qu.: 237051

Max.   :32561   Max.   :90.00                      Max.   :1484705

EDUCATION         EDUCATION_NUM   MARITAL_STATUS      OCCUPATION

Length:32561       Min.   : 1.00   Length:32561       Length:32561

Class :character   1st Qu.: 9.00   Class :character   Class :character

Mode  :character   Median :10.00   Mode  :character   Mode  :character

Mean   :10.08

3rd Qu.:12.00

Max.   :16.00

RELATIONSHIP           RACE               SEX             CAPITAL_GAIN

Length:32561       Length:32561       Length:32561       Min.   :    0

Class :character   Class :character   Class :character   1st Qu.:    0

Mode  :character   Mode  :character   Mode  :character   Median :    0

Mean   : 1078

3rd Qu.:    0

Max.   :99999

CAPITAL_LOSS    HOURS_PER_WEEK     INCOME

Min.   :   0.0   Min.   : 1.00   Length:32561

1st Qu.:   0.0   1st Qu.:40.00   Class :character

Median :   0.0   Median :40.00   Mode  :character

Mean   :  87.3   Mean   :40.44

3rd Qu.:   0.0   3rd Qu.:45.00

Max.   :4356.0   Max.   :99.00

———————————————————————————-

str(reg_df)

‘data.frame’:  32561 obs. of  15 variables:

$ ID            : int  139 379 619 859 1099 1339 1579 1819 2059 2299 …

$ AGE           : int  20 46 52 34 29 71 32 52 19 41 …

$ WORKCLASS     : chr  “Private” “Self-emp-not-inc” “Self-emp-inc” “Private” …

$ FNLWGT        : int  34310 80914 51048 188798 260729 269708 244268 185407 354104 129865 …

$ EDUCATION     : chr  “Some-college” “Masters” “Bachelors” “Bachelors” …

$ EDUCATION_NUM : int  10 14 13 13 9 13 13 9 9 9 …

$ MARITAL_STATUS: chr  “Never-married” “Divorced” “Married-civ-spouse” “Never-married” …

$ OCCUPATION    : chr  “Sales” “Exec-managerial” “Sales” “Prof-specialty” …

$ RELATIONSHIP  : chr  “Own-child” “Not-in-family” “Husband” “Own-child” …

$ RACE          : chr  “White” “White” “White” “White” …

$ SEX           : chr  “Male” “Male” “Male” “Female” …

$ CAPITAL_GAIN  : int  0 0 0 0 0 2329 0 0 0 0 …

$ CAPITAL_LOSS  : int  0 0 0 0 1977 0 0 0 0 0 …

$ HOURS_PER_WEEK: int  20 30 55 40 25 16 50 40 30 60 …

$ INCOME        : chr  “small” “small” “small” “small” …

———————————————————————————————————–

library(ggplot2)

par(mfrow=c(2,2))

boxplot(AGE ~ INCOME, data=reg_df,col=”green”)

title(“Age”)
boxplot(HOURS_PER_WEEK ~ INCOME, data=reg_df,col=”blue”)
title(“HOURS_PER_WEEK”)

Decision_trees_Netezza_pic1

————————————————————————————————————————-

library(ggplot2)

qplot(AGE, HOURS_PER_WEEK, data=reg_df, colour=INCOME, size=3)

Decision_trees_Netezza_pic2

———————————————————————————————————————–

Model Creation

# Building the tree tree using built-in analytics

adultTree = nzDecTree(INCOME~AGE+SEX+HOURS_PER_WEEK, nz_adult, id=”ID”)

Interpretation of the Decision Tree

plot(adultTree)

 

Decision_trees_Netezza_pic3

Double click the above image to see a magnified version of the tree

print(adultTree)

node), split, n, deviance, yval, (yprob)

* denotes terminal node

1) root 32561 NA small ( 0.240809…. 0.759190…. )

2) AGE < 27 8031 NA small ( 0.032125…. 0.967874…. )

4) AGE < 23 4772 NA small ( 0.006286…. 0.993713…. ) *

5) AGE > 23 3259 NA small ( 0.069960…. 0.930039…. )

10) HOURS_PER_WEEK < 41 2436 NA small ( 0.045977…. 0.954022…. ) *

11) HOURS_PER_WEEK > 41 823 NA small ( 0.140947…. 0.859052…. ) *

3) AGE > 27 24530 NA small ( 0.309131…. 0.690868…. )

6) SEX=Female 7366 NA small ( 0.150963…. 0.849036…. )

12) HOURS_PER_WEEK < 44 6081 NA small ( 0.128432…. 0.871567…. ) *

13) HOURS_PER_WEEK > 44 1285 NA small ( 0.257587…. 0.742412…. ) *

7) SEX < >Female 17164 NA small ( 0.377010…. 0.622989…. )

14) HOURS_PER_WEEK < 41 10334 NA small ( 0.299883…. 0.700116…. )

28) AGE < 36 3129 NA small ( 0.187919…. 0.812080…. ) *

29) AGE > 36 7205 NA small ( 0.348507…. 0.651492…. )

58) HOURS_PER_WEEK < 34 1124 NA small ( 0.158362…. 0.841637…. ) *

59) HOURS_PER_WEEK > 34 6081 NA small ( 0.383654…. 0.616345…. ) *

15) HOURS_PER_WEEK > 41 6830 NA small ( 0.493704…. 0.506295…. )

30) AGE < 35 1925 NA small ( 0.346493…. 0.653506…. )

60) AGE < 29 417 NA small ( 0.235011…. 0.764988…. ) *

61) AGE > 29 1508 NA small ( 0.377320…. 0.622679…. ) *

31) AGE > 35 4905 NA large ( 0.551478…. 0.448521…. ) *

——————————————————————————————————————

Model prediction

adultPred = predict(adultTree, nz_adult, id=”ID”)

head(adultPred)

ID CLASS

1  1 small

2  2 small

3  3 small

4  4 small

5  5 small

6  6 small

Confusion Matrix

We will use Netezza’s inbuilt confusion matrix stored procedures to evaluate the accuracy of the above model.

t=nzQuery(“EXECUTE  NZA..CONFUSION_MATRIX(‘intable=adult,resulttable=adultpredtab, id=id, target=income, matrixTable=adult_cm’)”)

The above command run from the R studio calls the CONFUSION_MATRIX stored procedure which calculates the accuracy of the model. The confusion matrix is stored in the table named adult_cm.

We then generate the confusion matrix statistics

z=nzQuery(“EXECUTE NZA..CMATRIX_STATS(‘matrixTable=adult_cm’)”)

head(z)

CMATRIX_STATS

1 class -> large\n\tTrue Positive Rate (sensitivity/recall): 0.344982\n\tFalse Positive Rate: 0.088997\n\tPositive Predictive Value (precision): 0.551478\n\tF-Measure: 0.4244472333311\n\nclass -> small\n\tTrue Positive Rate (sensitivity/recall): 0.911003\n\tFalse Positive Rate: 0.655018\n\tPositive Predictive Value (precision): 0.81429\n\tF-Measure: 0.85993582872011\n\n—————————————\nCorrectly Classified Instances: 25225\nIncorrectly Classified Instances: 7336\nAccuracy: 77.47 %\nWeighted Accuracy: 62.79925 %\n

Below are the confusion Matrix stats for the decision tree model

DBATEST.ADMIN(ADMIN)=> select * from adult_cm

REAL  | PREDICTION |  CNT

——-+————+——-

large | large      |  2705

large | small      |  5136

small | small      | 22520

small | large      |  2200

class – large

True Positive Rate (sensitivity/recall): 0.344982

False Positive Rate: 0.088997

Positive Predictive Value (precision): 0.551478

F-Measure: 0.4244472333311

 

class – small

True Positive Rate (sensitivity/recall): 0.911003

False Positive Rate: 0.655018

Positive Predictive Value (precision): 0.81429

F-Measure: 0.85993582872011

Thus we see how the model generation, prediction and model accuracy processing can be completely pushed down to the Netezza database by taking advantage of the NZ Analytics and the R nza / nzr packages.

 

 

12 thoughts on “Predictive Analytic s – Decision Trees using OpenR and IBM Netezza

  1. What¦s Taking place i am new to this, I stumbled upon this I’ve discovered It absolutely helpful and it has aided me out loads. I am hoping to give a contribution & aid different customers like its aided me. Great job.

  2. My significant other and I stumbled over here originating from a different webpage
    and thought I should check things out. I enjoy what
    I see so i am just just following you. Enjoy
    exploring your online page repeatedly.

  3. I’ve been browsing online more than 2 hours today, yet I never found any interesting article like yours.
    It is pretty worth enough for me personally. In my
    view, if all webmasters and bloggers made good content as you did, the
    web will probably be a lot more useful than before.

  4. whoah this weblog is fantastic i like reading your articles.
    Keep up the good work! You realize, lots of people are looking
    round for this information, you could aid them greatly.

  5. I have got been browsing online greater than 3 hours
    today, but I never found any fascinating article like yours.
    It’s beautiful price enough for me personally.
    For me, if all site owners and bloggers made good content material as it is likely you did, the net might be a lot more useful than before.

  6. Howdy! This blog post couldn’t be written any better! Dealing with this article reminds me
    of my previous roommate! He continually kept discussing this.
    I am going to forward this data to him. Confident he’s gonna use a great read.
    I appreciate you for sharing!

  7. Normally I will not read post on blogs, however I wish to claim that this write-up very
    forced me to try out and do it! Your writing taste is amazed me.
    Thanks, very great post.

Leave a Reply

Your email address will not be published. Required fields are marked *