Taming Apache Spark with Python – Programming examples Part 4

This blog is continuation of Taming Apache Spark with Python – Programming examples Part 3.

Building on Parts 1, 2 and 3. we shall explore some more programming principals using an example. The Apache Spark architecture diagram will be provided throughout the series to make it easier to follow the progressive programming examples.

Spark_blog1_pic1

Program Flow

Apache Spark provides different APIs for programming. We can use python, scala or java. Here we are using a python IDE to develop a Spark program using the python programming language.

Spark typically has an onion layer architecture. When the python program is run (developed using the Enthough Canopy IDE in this example), it runs on Spark core engine in Scala. Scala in turn is developed in Java, so the program is ultimately running in Java.

Programming Principals explored

  1. Flatmap
  2. Regular Expressions
  3. Key / values swap

Test Data

The test data in this case is the below paragraph saved in a file called testfile.txt.

//////////////////////////////////////////////////////////////////////////////////////

This is Python and Spark blogs. Let us see how we could scale python on Spark.

This is going to get more interesting as we see how Python could be used in Machine Learning.

Also python could be used for creating cognitive applications using Natural Language processing.

We will cover more in upcoming blogs. Keep checking updates on www.theanalyticsuniverse.com

///////////////////////////////////////////////////////////////////////////////////////////////////////////

Problem Statement

We will write a Python Spark program that will give us the count of each distinct word in the test data.

The Python Program

///////////////////////////////////////////////////////////////////////////////////

import re

from pyspark import SparkConf, SparkContext

 def normalizeWords(text):

return re.compile(r’\W+’, re.UNICODE).split(text.lower())

 conf = SparkConf().setMaster(“local”).setAppName(“CountTheWords”)

sc = SparkContext(conf = conf)

 input = sc.textFile(“file:///sparkcourse/book.txt”)

words = input.flatMap(normalizeWords)

 wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

wordCountsSorted = wordCounts.map(lambda (x,y): (y,x)).sortByKey()

results = wordCountsSorted.collect()

 for result in results:

count = str(result[0])

word = result[1].encode(‘ascii’, ‘ignore’)

if (word):

print word + “:\t\t” + count

 ///////////////////////////////////////////////////////////////////////////////////////////////////////////

Program Steps:

////////////////

/////////

///

The first step is to import the Spark Context and the Spark configuration into the python program. A Spark Context is the main entry point for the Spark functionality. It represents a connection to the spark cluster (or your local computer if running in local mode) and is used to create the RDDs, accumulators and broadcast variables in the cluster.

////////////////

/////////

///

We then create a Spark context called sc, providing the program name as CountTheWords. setMaster(“local”) tells Spark to run the program in local mode. SparkConf() is the Spark configuration, and we are taking all default configuration values here.

////////////////

/////////

///

We use python regular expressions here by calling re.compiler. Notice that we import the regular expression library at the beginning of the script(import re). The expression

 

            return re.compile(r’\W+’, re.UNICODE).split(text.lower())

instructs python to split words from the sentence and take care of any Unicode         characters     during the process. The

split(text.lower()) function replaces the case of all the words to lower case.

words = input.flatMap(normalizeWords) – Here we use a flat map. A flatmap will create multiple lines of words for each line of sentence so unlike a normal map function where the number of input rows equals the number of output rows, in a flat map, the number of output rows will be greater than the number of input rows. The output of the flatmap function creates the rdd called words.

            Eg: An input of a row “How are you” creates an output as below

How

Are

You

////////////////

/////////

///

wordCounts = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

We then take the words rdd and add a “1” to each of the rows. Since the keys are the words, we then reduceByKey and add all the “1” s to come up with the count.

Eg: consider the words rdd as

Hello

How

Are

Hello

 

Produces

Hello    1

How     1

Are       1

Hello    1

And finally the wordCounts rdd has

Hello    2

How     1

Are       1

////////////////

/////////

///

wordCountsSorted = wordCounts.map(lambda (x,y): (y,x)).sortByKey()

results = wordCountsSorted.collect()

 Since the keys are the words and the count of the words are the values, we flip the key value pairs and sort on the count of the number of words. Our goal is to display the word counts with a sorted order of the counts.

////////////////

/////////

///

for result in results:

                count = str(result[0])

               word = result[1].encode(‘ascii’, ‘ignore’)

               if (word):

              print word + “:\t\t” + count

 

   The above lines simply prints the outputs to standard output.

The output of this program for the given test data is given below

/////////////////////////////////////////////////////////

and:            1

www:            1

as:             1

learning:               1

scale:          1

for:            1

interesting:            1

machine:                1

to:             1

going:          1

creating:               1

upcoming:               1

get:            1

processing:             1

applications:           1

let:            1

updates:                1

using:          1

also:           1

natural:                1

cognitive:              1

language:               1

checking:               1

cover:          1

us:             1

keep:           1

will:           1

theanalyticsuniverse:           1

com:            1

is:             2

in:             2

how:            2

more:           2

be:             2

used:           2

spark:          2

on:             2

see:            2

this:           2

blogs:          2

we:             3

could:          3

python:         4

////////////////////////////////////////////////////////////

 

16 thoughts on “Taming Apache Spark with Python – Programming examples Part 4

    1. Hi, I currently do not have an email or e-newsletter service, however I want to thank you for visiting my site.

  1. It is in point of fact a nice and helpful piece of info. I’m happy that you shared this useful information with us. Please keep us informed like this. Thanks for sharing.

  2. I do are in agreement with every one of the ideas you may have introduced to the post.
    They’re very convincing and will definitely work. Still, the posts are too brief for starters.
    May you please extend them a bit from subsequent time?
    Many thanks for your post.

  3. Hello! I really could have sworn I’ve gone to your blog site before but after checking out several of
    the articles I realized it’s a novice to me. Anyhow, I’m
    definitely happy I came across it and I’ll be book-marking it and checking
    back frequently!

  4. It’s actually a nice and helpful bit of info. I’m glad that you just shared this useful information with us.
    Please keep us informed similar to this. Thanks for sharing.

Leave a Reply

Your email address will not be published. Required fields are marked *