Predictive Analytic s – Naive Bayes using OpenR and IBM Netezza

Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.

More details on Naive Bayes available at ttps://en.wikipedia.org/wiki/Naive_Bayes_classifier

Problem Description

We will look at the Iris Plants Data-set (from UCI Repository) and this is perhaps the best known data-set to be found in the pattern recognition literature. The data-set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other two. There are 4 numeric attributes and the response variable is the class of instances called Iris-Setosa.

In this exercise, we will look at using the Naïve Bayes method for this classification problem. The nzNaiveBayes() function is used for the modeling.

Data Engineering

The Iris data is downloaded from the UCI Repository and uploaded by Netezza using R. Then it is split into two datasets: training (iris_train_data, 91 instances) and testing (iris_test_data, 58 instances).

—————————————————————————————————

iris_train = nz.data.frame(“iris_train_data”)

head(iris_train)

SELECT UPPER(objtype) as “TYPE” FROM _v_obj_relation WHERE objname = ‘IRIS_TRAIN_DATA’

SELECT “X5_1″,”X3_5″,”X1_4″,”X0_2″,”Species”,”ROW_NUMBER” FROM IRIS_TRAIN_DATA ORDER BY rowid  LIMIT 6

SELECT objid FROM _v_table WHERE tablename = ‘IRIS_TRAIN_DATA’

SELECT attname, atttype FROM _v_relation_column_def WHERE objid = 1448908 AND attnum > 0 ORDER BY attnum

X5_1 X3_5 X1_4 X0_2     Species ROW_NUMBER

1  7.2  3.2  6.0  1.8  Iris-virginica        125

2  7.2  3.0  5.8  1.6  Iris-virginica        129

3  5.5  4.2  1.4  0.2     Iris-setosa         33

4  5.5  3.5  1.3  0.2     Iris-setosa         36

5  5.5  2.3  4.0  1.3 Iris-versicolor         53

6  5.5  2.4  3.8  1.1 Iris-versicolor         80

—————————————————————————————————

Modeling and Prediction

# begin to fit the model

fit_iris = nzNaiveBayes(Species ~ ., iris_train, id=”ROW_NUMBER”)

# predict the Species for the testing data from the Naive Bayes Modeling

iris_test_pred = predict(fit_iris, iris_test, id = “ROW_NUMBER”)

—————————————————————————————————

# download the prediction results and sort by id column

prediction = as.data.frame(iris_test_pred)

prediction = prediction[order(prediction[,1]),]

——————————————————————————————————-

# download the original data and sort by ID column

orig_data = as.data.frame(iris_test[, c(“ROW_NUMBER”,”Species”)])

orig_data = orig_data[order(orig_data[,1]),]

—————————————————————————————————

# create contingency table for the actual and predicted values

con_tab = table(Actual = orig_data[,2], predicted = prediction[,2])

con_tab

predicted

Actual                                   Iris-setosa                        Iris-versicolor                         Iris-virginica

Iris-setosa                                22                                             0                                               0

Iris-versicolor                         0                                               20                                              0

Iris-virginica                            0                                               2                                                14

————————————————————————————————————————-

From the contingency table of the actual values versus predicted values in the testing data, among 58 testing instances, only 2 of them are mis-classified.

 

2 thoughts on “Predictive Analytic s – Naive Bayes using OpenR and IBM Netezza

Leave a Reply

Your email address will not be published. Required fields are marked *