Predictive Analytic s – Market Basket Analysis using OpenR and Netezza

 

Associative Rules Mining (Market Basket Analysis)

Associative rule mining is a method to discover hidden relationships between variables in large data sets. It is intended to identify strong rules discovered in databases using measures of interestingness.

Example

{milk, bread} => {butter}

This rule says that customers buying milk and bread together is also likely to buy butter. Such relationships and information could be used to make inventorying stocking decisions in supermarkets.

Associative Rules Mining is also popularly known as Market Basket Analysis.

Problem Statement

In this exercise we will look at an example of associative rules mining. We will use the open source retail data for this purpose. The nzArule() function is a wrapper to the Netezza Analytics ARULE stored procedure. This function assumes that the data in the input table is in the form of (TID, ITEMID).

Data Engineering and Analysis

The input data to be fed into the algorithm is in the form of a table called retail stored in the Netezza database called dbatest.

DBATEST.ADMIN(ADMIN)=> SELECT * FROM RETAIL LIMIT 100;

TID  | ITEM

——+——

238 |   38

238 |   39

238 |   41

238 |   48

238 |  105

238 | 1172

238 | 1173

478 |  381

478 |  846

478 | 2003

478 | 2004

478 | 2005

 

Model Creation

# load the necessary packages

Library(nzr)

Library(nza)

# connect to the dbatest database

nzConnectDSN(‘DBATEST’, force = TRUE, verbose = TRUE)

# Define a netezza dataframe nzretail to point to the retail table

nzretail = nz.data.frame(“retail”)

# Generate the Associative Rules Model

mbk_rules = nzArule(nzretail, “TID”, “ITEM”)

nzDisconnect()

Model Analysis

We will use the overload functions print(), summary(), inspect() and sort() to evaluate the model we created.

The below shows that the Association Rules algorithm generated 14 rules.

> print(mbk_rules)

set of 14 rules

> summary(mbk_rules)

set of 14 rules

rule length distribution (lhs + rhs):sizes

2 3

8 6

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

2.000   2.000   2.000   2.429   3.000   3.000

summary of quality measures:

support          confidence          lift

Min.   :0.06127   Min.   :0.5094   Min.   :0.9698

1st Qu.:0.07280   1st Qu.:0.5788   1st Qu.:1.1579

Median :0.09062   Median :0.6421   Median :1.2187

Mean   :0.12253   Mean   :0.6447   Mean   :1.2246

3rd Qu.:0.11358   3rd Qu.:0.6868   3rd Qu.:1.3344

Max.   :0.33055   Max.   :0.8168   Max.   :1.4210

mining info:

data ntransactions support confidence             model

SELECT “TID”,”ITEM” FROM RETAIL         908576       5        0.5 RETAIL_MODEL21617

Above, we see the Quadrilles information of the support, confidence and lift values.

> inspect(mbk_rules)

lhs        rhs  support    confidence lift

1  {48}    => {39} 0.33055058 0.6916340  1.2032726

2  {41}    => {39} 0.12946621 0.7637337  1.3287082

3  {32,48} => {39} 0.06127356 0.6723923  1.1697968

4  {38,48} => {39} 0.06921349 0.7681269  1.3363513

5  {38}    => {39} 0.11734080 0.6633111  1.1539977

6  {32}    => {39} 0.09590300 0.5574603  0.9698434

7  {41,48} => {39} 0.08355074 0.8168108  1.4210493

8  {41}    => {48} 0.10228897 0.6034125  1.2625621

9  {38}    => {48} 0.09010685 0.5093614  1.0657723

10 {39}    => {48} 0.33055058 0.5750765  1.2032726

11 {32}    => {48} 0.09112770 0.5297026  1.1083338

12 {38,39} => {48} 0.06921349 0.5898502  1.2341847

13 {39,41} => {48} 0.08355074 0.6453478  1.3503063

14 {32,39} => {48} 0.06127356 0.6389119  1.3368399

 

We see some association rules above. For example {48} occurs , {39} occurs with a high confidence of 0.69 and lift of 1.2. The occurrence of {48} has a support of 0.33.

Below is a quick description of the support, confidence and lift parameters

Support

The support value of X with respect to T is defined as the proportion of transactions in the database which contains the item-set X.

Confidence

The confidence value of a rule, X \Rightarrow Y , with respect to a set of transactions T, is the proportion the transactions that contains X which also contains Y.

Lift

The lift of a rule is defined as the ratio of the observed support to that expected if X and Y were independent.

Again we see how the model creation was pushed down to the Netezza appliance thus achieving a very high scalability.

Association Rule Graphics

Since the Netezza association rule model we created had a class of rule we were able to  use this to directly generate plots using the arulesViz R package.
>library(arulesViz)
>plot(mbk_rules)
>plot(mbk_rules, method=”grouped”)

netezza_associative_rules_mining_pic2

 

netezza_associative_rules_mining_pic1

Author : Abhik Roy

14 thoughts on “Predictive Analytic s – Market Basket Analysis using OpenR and Netezza

  1. I am very happy to read this. This is the type
    of manual that needs to be given and not the
    accidental misinformation that is at the other blogs.
    Appreciate your sharing this best doc.

  2. Howdy! This is my first comment here so I just wanted to give a quick shout out and say I really enjoy reading through your blog posts. Can you recommend any other blogs/websites/forums that go over the same topics? Thanks a lot!

  3. Pretty great post. I simply stumbled upon your weblog and wanted to say that I actually have really enjoyed surfing around your blog posts.
    In the end I am going to be subscribing to your feed and I’m hoping
    you write again soon!

  4. Excellent post. Keep writing such kind of information on your site.
    Im really impressed by your site.
    Hey there, You have done an excellent job. I will definitely
    digg it and personally suggest to my friends.
    I am confident they will be benefited from this website.

  5. I feel this is on the list of most important info for me.
    And i’m glad reading your article. But wish to remark on few general things,
    The web site style is great, the articles is absolutely nice
    : D. Good job, cheers

Leave a Reply

Your email address will not be published. Required fields are marked *