Understanding Complex relationships using Apache Spark

Apache Spark could be used to process graph or relation data. To illustrate a particular use case, consider a grocery store sales data. We are interested to identify the  grocery item that is most popular. We define popularity by the number of times the item was purchased along with other items.  Consider the below illustration.

spark_graph_relations_pic1

Here, we see cheese was bought along with milk, fish and carrot, but flour was bought only with cheese.

Hence, based on our ground rules on popularity, we can say cheese is more popular than flour, as it was bought along with milk, fish or carrots.

Test Data

To illustrate the point, below is the test data for which we shall write a python program.

Grocery-Graph.txt

//////////////////////////////

1 13 17 4 6 3

2 11 9 5 8 7

1 4 8 13 12 19 14

3 7 8 9 11 14 15 19

4 6 8 4 16 19

5 5 11 18 19 20

6 7 1 3 6 9

7 3 4 4 5 7 7 7 8 9

8 3 7 7 8 11 11 11

9 1 1 1 5 5 7 8

10 3 4 6 6 6 8 9

//////////////////////////////////////////////////

 

In the above data file, for each given row, the first column represents the id of a grocery item. The other columns in the row represents the ids of the other grocery items that were bought as well, together in a particular sale or as different sales. We shall define the post popular item as the one which has the most number of other grocery items that were bought along with it.

Grovery-Names.txt

/////////////////////////////////////////

1 milk

2 bread

3 cheese

4 hotdog

5 cake

6 coffee

7 rice

8 salt

9 chicken

10 pork

11 fish

12 flour

13 potato

14 eggplant

15 carrot

16 lettuce

17 beetroot

18 chives

19 kidneybeans

20 pepper

//////////////////////////////////////

This is just a lookup table, which gives the names of the grocery items for each of the grocery ids.

The program code along with comments

/////////////////////////////////////////////////////////////////

# Import the Spark Context and Config

from pyspark import SparkConf, SparkContext

# Set the name of the program to PopularGrocery and create the spark context

# running in local mode and using default configurations

conf = SparkConf().setMaster(“local”).setAppName(“PopularGrocery”)

sc = SparkContext(conf = conf)

# Define a function called countCoOccurences that will take each line and return a line

 # containing the id of the grocery item in the first column and the number of other items

# that was purchased along with it.

# eg       1 13 17 4 6 3

#             2 11 9 5 8 7

#            1 4 8 13 12 19 14

# Will produce

 # 1    5

# 2    5

# 1    6

def countCoOccurences(line):

elements = line.split()

return (int(elements[0]), len(elements) – 1)

# Define a function called parseNames that will create a list of the grocery id

# and the corresponding grocery item name

# eg    1 milk

# Note that the data is split based on white space

def parseNames(line):

fields = line.split()

return (int(fields[0]), fields[1].encode(“utf8”))

# Read the Grocery-Names.txt test file and apply the defined parseNames

# function and create an rdd called namesRdd

 names = sc.textFile(“file:///SparkCourse/Grocery-Names.txt”)

namesRdd = names.map(parseNames)

# Read the Grocery-Graph.txt file into the lines rdd and apply the

# countCoOccurences function, thus creating an rdd called pairings.

# Perform a reduce by key operation and add up the co-occurrences of

# all the other grocery items for a particular grocery id .

# example 1   5

#                1   6

# Will produce

#        1    11

# Then flip the key, value pairs

#   ex 1   11

#  becomes

#  11  1

 lines = sc.textFile(“file:///SparkCourse/Grocery-Graph.txt”)

pairings = lines.map(countCoOccurences)

totalGroceryById = pairings.reduceByKey(lambda x, y : x + y)

flipped = totalGroceryById.map(lambda (x,y) : (y,x))

# Find the grocery id that has the maximum number of co occurrences

mostPopular = flipped.max()

# Perform a lookup function and retrieve the grocery name for the corresponding

# grocery id.

mostPopularName = namesRdd.lookup(mostPopular[1])[0]

# Fun Part! Print out the results and see what you get

print mostPopularName + ” is the most popular grocery, with ” + \

str(mostPopular[0]) + ” co-appearances.”

//////////////////////////////////////////////////////////////////////////////

Running the code produces the below output

milk is the most popular grocery, with 11 co-appearances.

////////////////////////////////////////////

References: Inspired by the excellent training materials produced by Frank Kane

http://frank-kane.com/

///////////////////////////////////////////////////////////////////////////

======================================================

============================================

======================

====

Author: Abhik Roy

 

5 thoughts on “Understanding Complex relationships using Apache Spark

  1. Wow! This can be one particular of the most beneficial blogs We’ve ever arrive across on this subject. Basically Wonderful. I’m also a specialist in this topic so I can understand your effort.

  2. I simply want to mention I am newbie to blogging and site-building and truly savored this blog site. Very likely I’m want to bookmark your blog . You amazingly come with tremendous well written articles. Thanks a bunch for sharing your blog.

Leave a Reply

Your email address will not be published. Required fields are marked *