Predictive Analytic s – K means Clustering using OpenR and IBM Netezza

K-Means Clustering

The K-means Clustering is a grouping or clustering algorithm used to group data based on similar attributes. In this case study, we are using a data set called auto-data.csv. The goal of our implementation is to group the cars into 3 and 5 clusters based on their attributes using Netezza’s built in Analytics.

Techniques Used

  • K-Means Clustering
  • Centering and Scaling

Data Engineering and Analysis

We first load the nzr and nza libraries, set the working directory and then load the auto-data-km.csv file from our desktop to the Netezza database as a table called auto_data_km. We use the as.nz.data.frame for this purpose.

——————————————————————————————

# Load required libraries

library(nzr)

library(nza)

#Set your working directory

setwd(“C:/Users/user1/Documents/Netezza R case study”)

# Connect to the Netezza database

nzConnectDSN(‘DBATEST’, force = TRUE, verbose = TRUE)

# Load the auto-data-km.csv file into the Netezza database in a table called auto_data_km

auto_data <- read.csv(“auto-data-km.csv”)

d <- as.nz.data.frame(auto_data, ‘auto_data_km’,clear.existing=TRUE, fast=TRUE) # use fast mode

————————————————————————————————————–

Netezza creates  the below table to hold the auto data

////////////////////////////////////////////////////////////////

CREATE TABLE    AUTO_DATA_KM

(

ID                            INTEGER                                                                       ,

MAKE                          CHARACTER VARYING(13)                                                         ,

FUELTYPE                      CHARACTER VARYING(6)                                                          ,

ASPIRE                        CHARACTER VARYING(5)                                                          ,

DOORS                         CHARACTER VARYING(4)                                                          ,

BODY                          CHARACTER VARYING(11)                                                         ,

DRIVE                         CHARACTER VARYING(3)                                                          ,

CYLINDERS                     CHARACTER VARYING(6)                                                          ,

HP                            INTEGER                                                                       ,

RPM                           INTEGER                                                                       ,

MPG_CITY                      INTEGER                                                                       ,

MPG_HWY                       INTEGER                                                                       ,

PRICE                         INTEGER

) DISTRIBUTE ON Hash ( ID  )    ORGANIZE ON None

/////////////////////////////////////////////////////////////////////////

A quick look at the data as stored in the Netezza database

—————————————————————————————————————

DBATEST.ADMIN(ADMIN)=> select * from auto_data_km limit 3;

ID  |    MAKE    | FUELTYPE | ASPIRE | DOORS |   BODY    | DRIVE | CYLINDERS | HP  | RPM  | MPG_CITY | MPG_HWY | PRICE

—–+————+———-+——–+——-+———–+——-+———–+—–+——+———-+———+——-

25     | mazda         | gas            | std        | four        | sedan      | fwd           | four        |  68   | 5000   |       31       |      38            |  6695

114   | peugot         | gas            | std         | four        | sedan     | rwd           | four        |  97   | 5000    |       1 9     |      24            | 11900

118    | Mitsubishi | gas           | turbo    | two          | hatchback | fwd      | four         | 145  | 5000    |       19        |      24         | 12629

———————————————————————————————————

str(auto_data)

‘data.frame’:                   197 obs. of  13 variables:

$ ID       : int  1 2 3 4 5 6 7 8 9 10 …

$ MAKE     : Factor w/ 21 levels “alfa-romero”,..: 18 4 9 19 12 6 13 5 15 9 …

$ FUELTYPE : Factor w/ 2 levels “diesel”,”gas”: 2 2 2 2 2 2 2 2 2 2 …

$ ASPIRE   : Factor w/ 2 levels “std”,”turbo”: 1 1 1 1 1 1 1 1 1 1 …

$ DOORS    : Factor w/ 2 levels “four”,”two”: 2 2 2 2 2 2 2 2 2 2 …

$ BODY     : Factor w/ 5 levels “convertible”,..: 3 3 3 3 3 3 4 3 3 3 …

$ DRIVE    : Factor w/ 3 levels “4wd”,”fwd”,”rwd”: 2 2 2 2 2 2 2 2 2 2 …

$ CYLINDERS: Factor w/ 7 levels “eight”,”five”,..: 3 5 3 3 3 3 3 3 3 3 …

$ HP       : int  69 48 68 62 68 60 69 68 68 68 …

$ RPM      : int  4900 5100 5000 4800 5500 5500 5200 5500 5500 5000 …

$ MPG.CITY : int  31 47 30 35 37 38 31 37 37 31 …

$ MPG.HWY  : int  36 53 31 39 41 42 37 41 41 38 …

$ PRICE    : int  5118 5151 5195 5348 5389 5399 5499 5572 5572 6095 …

————————————————————————————————————-

summary(auto_data)

ID              MAKE      FUELTYPE     ASPIRE     DOORS              BODY

Min.   :  1   toyota    :32   diesel: 19   std  :162   four:112   convertible: 6

1st Qu.: 50   nissan    :18   gas   :178   turbo: 35   two : 85   hardtop    : 8

Median : 99   mazda     :16                                       hatchback  :67

Mean   : 99   honda     :13                                       sedan      :92

3rd Qu.:148   mitsubishi:13                                       wagon      :24

Max.   :197   subaru    :12

(Other)   :93

DRIVE      CYLINDERS         HP             RPM          MPG.CITY        MPG.HWY

4wd:  8   eight :  4   Min.   : 48.0   Min.   :4150   Min.   :13.00   Min.   :16.00

fwd:114   five  : 10   1st Qu.: 70.0   1st Qu.:4800   1st Qu.:19.00   1st Qu.:25.00

rwd: 75   four  :153   Median : 95.0   Median :5200   Median :24.00   Median :30.00

six   : 24   Mean   :103.6   Mean   :5118   Mean   :25.15   Mean   :30.63

three :  1   3rd Qu.:116.0   3rd Qu.:5500   3rd Qu.:30.00   3rd Qu.:34.00

twelve:  1   Max.   :262.0   Max.   :6600   Max.   :49.00   Max.   :54.00

two   :  4

PRICE

Min.   : 5118

1st Qu.: 7775

Median :10345

Mean   :13280

3rd Qu.:16503

Max.   :45400

———————————————————————————————————–

head(auto_data)

ID       MAKE FUELTYPE ASPIRE DOORS      BODY DRIVE CYLINDERS HP  RPM MPG.CITY

1  1     subaru      gas    std   two hatchback   fwd      four 69 4900       31

2  2  chevrolet      gas    std   two hatchback   fwd     three 48 5100       47

3  3      mazda      gas    std   two hatchback   fwd      four 68 5000       30

4  4     toyota      gas    std   two hatchback   fwd      four 62 4800       35

5  5 mitsubishi      gas    std   two hatchback   fwd      four 68 5500       37

6  6      honda      gas    std   two hatchback   fwd      four 60 5500       38

MPG.HWY PRICE

1      36  5118

2      53  5151

3      31  5195

4      39  5348

5      41  5389

6      42  5399

—————————————————————————————————————

Since a clustering algorithm needs the attributes to be in the same scale, we shall scale the data and load it into a table called auto_data_km_sc.

—————————————————————————————————–

# Scale columns 9 to 13

scaled_num <- scale( auto_data[9:13])

#put the attributes back into the main data frame

auto_data[,9:13] <- scaled_num

#Reload the data into a table called auto_data_km_sc

d <- as.nz.data.frame(auto_data, ‘auto_data_km_sc’,clear.existing=TRUE, fast=TRUE)

——————————————————————————————————————-

DBATEST.ADMIN(ADMIN)=> select * from auto_data_km_sc limit 2;

ID  |  MAKE  | FUELTYPE | ASPIRE | DOORS |   BODY    | DRIVE | CYLINDERS |        HP         |        RPM        |     MPG_CITY      |      MPG_HWY      |       PRICE

—–+——–+———-+——–+——-+———–+——-+———–+——————-+——————-+——————-+——————-+——————-

33 | subaru | gas      | std    | two   | hatchback | fwd   | four      | -0.81308998501737 |  -1.4926542563592 |  0.13167657450282 | 0.054204848653784 | -0.77732645108611

105 | toyota | gas      | std    | two   | hardtop   | rwd   | four      |  0.32933583403756 | -0.66111551191871 | -0.17898552342599 |  -0.0920739894941 | -0.25974505099367

————————————————————————————————————————-

summary(auto_data)

ID              MAKE      FUELTYPE     ASPIRE     DOORS              BODY

Min.   :  1   toyota    :32   diesel: 19   std  :162   four:112   convertible: 6

1st Qu.: 50   nissan    :18   gas   :178   turbo: 35   two : 85   hardtop    : 8

Median : 99   mazda     :16                                       hatchback  :67

Mean   : 99   honda     :13                                       sedan      :92

3rd Qu.:148   mitsubishi:13                                       wagon      :24

Max.   :197   subaru    :12

(Other)   :93

DRIVE      CYLINDERS         HP               RPM             MPG.CITY

4wd:  8   eight :  4   Min.   :-1.4773   Min.   :-2.0124   Min.   :-1.8876

fwd:114   five  : 10   1st Qu.:-0.8928   1st Qu.:-0.6611   1st Qu.:-0.9556

rwd: 75   four  :153   Median :-0.2286   Median : 0.1704   Median :-0.1790

six   : 24   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000

three :  1   3rd Qu.: 0.3293   3rd Qu.: 0.7941   3rd Qu.: 0.7530

twelve:  1   Max.   : 4.2083   Max.   : 3.0808   Max.   : 3.7043

two   :  4

MPG.HWY             PRICE

Min.   :-2.13998   Min.   :-1.0189

1st Qu.:-0.82347   1st Qu.:-0.6872

Median :-0.09207   Median :-0.3664

Mean   : 0.00000   Mean   : 0.0000

3rd Qu.: 0.49304   3rd Qu.: 0.4024

Max.   : 3.41862   Max.   : 4.0099

——————————————————————————————————-

head(auto_data)

ID       MAKE FUELTYPE ASPIRE DOORS      BODY DRIVE CYLINDERS         HP         RPM

1  1     subaru      gas    std   two hatchback   fwd      four -0.9193622 -0.45323083

2  2  chevrolet      gas    std   two hatchback   fwd     three -1.4772910 -0.03746145

3  3      mazda      gas    std   two hatchback   fwd      four -0.9459302 -0.24534614

4  4     toyota      gas    std   two hatchback   fwd      four -1.1053385 -0.66111551

5  5 mitsubishi      gas    std   two hatchback   fwd      four -0.9459302  0.79407729

6  6      honda      gas    std   two hatchback   fwd      four -1.1584745  0.79407729

MPG.CITY    MPG.HWY      PRICE

1 0.9083318 0.78559904 -1.0188894

2 3.3936286 3.27233929 -1.0147697

3 0.7530008 0.05420485 -1.0092768

4 1.5296560 1.22443555 -0.9901765

5 1.8403181 1.51699323 -0.9850581

6 1.9956492 1.66327207 -0.9838097

————————————————————————————————————–

As we see above, the data for column numbers  9 to 13 has been scaled to bring them to the same scale level.

The below box plots in R studio shows how the HP, RPM, MPG.CITY, MPG.HWY and PRICE columns have been scaled.

———————————————————————————————————————-

par(mfrow=c(1,5))

boxplot( auto_data$HP,col=”red”)

title(“HP”)

boxplot( auto_data$RPM,col=”blue”)

title(“RPM”)

boxplot( auto_data$MPG.CITY,col=”green”)

title(“MPG.CITY”)

boxplot( auto_data$MPG.HWY, col=”maroon”)

title(“MPG.HWY”)

boxplot( auto_data$PRICE, col=”cyan”)
title(“PRICE”)

K_means_clustering_Netezza_pic1

————————————————————————————————————————-

Model Creation

We use the R wrapper nzKMeans which is a wrapper for the KMEANS stored procedure.

In this example, we use the K means clustering algorithm to split the data in to 3 and 5 clusters.

t3 = nzKMeans(d, k=3, distance=”euclidean”,  id=”ID”, getLabels=T, randseed=1234)

t5 = nzKMeans(d, k=5, distance=”euclidean”,  id=”ID”, getLabels=T, randseed=1234)

We use the print() overload function to look at the characteristics of the clusters formed

——————————————————————————————————————–

print (t3)

KMeans clustering with 3 clusters of sizes 55, 35, 107

 

Cluster means:

ASPIRE      BODY CYLINDERS DOORS DRIVE FUELTYPE   MAKE         HP   MPG_CITY

1    std     sedan      four  four   rwd      gas  volvo  0.3443105 -0.6026157

2    std hatchback       six   two   rwd      gas nissan  1.5476703 -1.1775423

3    std     sedan      four  four   fwd      gas toyota -0.6832294  0.6949331

MPG_HWY      PRICE         RPM

1 -0.6559124  0.4805895 -0.22077795

2 -1.1285640  1.1430643  0.38127770

3  0.7063077 -0.6209315 -0.01123301

 

Clustering vector:

1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21

3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3

22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42

3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3

43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63

3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3

64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84

3   3   3   3   3   1   3   3   3   3   3   3   3   3   3   3   3   1   3   3   3

85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105

3   3   2   3   1   3   3   3   3   1   3   3   3   3   3   3   3   3   2   1   1

106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126

3   3   1   1   3   1   2   1   1   1   3   1   2   2   1   3   2   3   1   1   1

127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147

2   2   2   3   1   3   2   2   2   1   1   1   1   2   1   2   1   2   1   1   2

148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168

2   1   2   1   1   1   3   1   1   2   1   1   1   1   1   2   1   3   2   1   1

169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

1   1   1   2   1   1   1   2   1   1   1   1   1   1   1   2   1   2   2   2   1

190 191 192 193 194 195 196 197

2   2   2   2   2   2   2   2

 

Within cluster sum of squares by cluster:

[1] 246.1781 223.7094 472.6430

 

Available components:

[1] “cluster”  “centers”  “withinss” “size”     “distance” “model”

————————————————————————————————————

print(t5)

KMeans clustering with 5 clusters of sizes 57, 14, 33, 62, 31

 

Cluster means:

ASPIRE      BODY CYLINDERS DOORS DRIVE FUELTYPE          MAKE         HP   MPG_CITY

1    std     sedan      four  four   rwd      gas         volvo  0.6584067 -0.7703336

2    std     sedan       six   two   rwd      gas mercedes-benz  2.2934732 -1.4882044

3    std hatchback      four   two   fwd      gas         honda -0.1021936 -0.1789855

4    std     sedan      four  four   fwd      gas        toyota -0.5559799  0.3771998

5    std hatchback      four   two   fwd      gas        nissan -1.0256343  1.5246453

MPG_HWY      PRICE         RPM

1 -0.8055041  0.5259462 -0.11952120

2 -1.5026199  2.7261478  0.06648089

3 -0.1408336 -0.3218807  1.26654249

4  0.3892306 -0.5147805 -0.60076189

5  1.5311492 -0.8260176  0.04301004

 

Clustering vector:

1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21

5   5   4   5   5   5   5   5   5   5   5   5   5   5   5   5   5   4   3   5   5

22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42

5   4   4   4   4   5   4   3   4   4   4   4   5   4   3   4   3   3   5   5   4

43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63

4   4   4   4   4   3   5   4   5   5   5   3   4   4   3   3   4   5   5   4   4

64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84

4   4   4   5   4   4   4   4   4   4   3   4   4   4   4   4   3   4   4   3   3

85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105

3   5   3   4   4   3   4   3   4   4   4   4   4   3   3   4   4   4   3   3   4

106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126

4   4   4   4   3   1   3   3   1   1   3   1   1   1   1   3   1   4   1   1   1

127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147

1   1   3   4   1   3   1   1   1   3   3   1   1   3   1   1   1   1   3   1   1

148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168

1   1   1   1   1   1   4   1   1   1   1   1   1   4   1   3   1   4   1   1   1

169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189

1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   2   1   2   2   2   2

190 191 192 193 194 195 196 197

2   2   2   2   2   2   2   2

 

Within cluster sum of squares by cluster:

[1] 248.40660  60.22313 114.20808 187.13469  99.91936

 

Available components:

[1] “cluster”  “centers”  “withinss” “size”     “distance” “model”

————————————————————————————————————-

We can look at the cluster values in multiple ways.

When GetLabels was TRUE

> head(t3$cluster)

1 2 3 4 5 6

3 3 3 3 3 3

If GetLabels in False

 > t3F = nzKMeans(d, k=3, distance=”euclidean”,  id=”ID”, getLabels=F, randseed=1234) > head(t3F$cluster)

ID       CLUSTER_ID      DISTANCE

1       75          3                         1.973554

2      141         1                         1.785687

3      149         1                         1.463274

4       39          3                         2.389416

5       180         1                        2.264715

6       155         1                        2.088582

A final plot of the cluster has been shown below

——————————————————————————————

par(mfrow=c(1,1))

plot(auto_data$HP, auto_data$PRICE, col=t3$cluster, pch=20, cex=2)
points(t3$centers, col=”purple”, pch=17, cex=3)

K_means_clustering_Netezza_pic2

9 thoughts on “Predictive Analytic s – K means Clustering using OpenR and IBM Netezza

  1. That is very attention-grabbing, You’re an excessively skilled blogger. I have joined your feed and stay up for in quest of extra of your fantastic post. Additionally, I’ve shared your web site in my social networks!

  2. Hiya, I am really glad I’ve found this info. Today bloggers publish just about gossips and net and this is really irritating. A good site with exciting content, this is what I need. Thanks for keeping this site, I will be visiting it. Do you do newsletters? Cant find it.

  3. Whats up very cool website!! Man .. Beautiful .. Superb .. I’ll bookmark your web site and take the feeds additionallyKI am satisfied to find a lot of useful info here within the publish, we’d like work out more techniques in this regard, thanks for sharing. . . . . .

  4. I do agree with all the ideas you have presented in your post. They are really convincing and will definitely work. Still, the posts are very short for beginners. Could you please extend them a bit from next time? Thanks for the post.

Leave a Reply

Your email address will not be published. Required fields are marked *