Category Archives: HADOOP

Apache Spark Streaming Deep Dive – Part2

In the earlier blog, Apache Spark Streaming Deep Dive Part 1, we saw the reference architecture of Spark Streaming and got an understanding of the building blocks needed to build streaming applications.

In this blog, we will look at a simple code example that leverages the Streaming Architecture.

Refer to the below link, for part1 of this series.

http://www.theanalyticsuniverse.com/apache-spark-streaming-deep-dive-part1

Problem Statement:

We shall look at a simple code that will have a micro batch window of 1 second. This stream of data would be an RDD of integers. Our streaming application will ingest this stream of integers every second (since we have defined a micro batch window of 1 second) and do some computation. In this example, we will be performing a modulo operation on the stream of numbers and then counting the number of occurrences of each of the modulo values.

The code explained:

Refer the below diagram to get a visual explanation of the code. Also, refer to the below code and the comments to see what function each line of the code does.

The Code

// Apache Code Streaming Example

// Import the necessary libraries

import org.apache.spark.streaming.StreamingContext

import org.apache.spark.streaming.StreamingContext._

import org.apache.spark.streaming.Seconds

// Use the Spark Context to define a Spark Streaming Object

var ssc = new StreamingContext(sc,Seconds(1))

// Import more libraries

import scala.collection.mutable.Queue

import org.apache.spark.rdd.RDD

// Define a new queue object called rddQueue, which is an RDD of integers

val rddQueue = new Queue[RDD[Int]]()

// Create the input queue Dstream and use it do some processing

val inputStream = ssc.queueStream(rddQueue)

// Apply Modulo function on the stream of RDDs

val mappedStream = inputStream.map(x => (x % 10, 1))

// Find the sum of all occurrences of each of the keys

val reducedStream = mappedStream.reduceByKey( (x,y) => x+y)

// Print the results

reducedStream.print()

//////////

////////////////////

// Start the streaming instance

ssc.start()

// simulate the data, 20 streams with a micro batch of 1 second

for (i <- 1 to 20) {

rddQueue.synchronized {

rddQueue += ssc.sparkContext.makeRDD(1 to 2000, 10)

}

Thread.sleep(1000)

}

// Stop the stream

ssc.stop()



spark_streaming_pic3

Watch the below video to see Apache Streaming in action

Pictoblog – Technical blogs could sometimes use some pretty photographs.. Well, the problem is, it is sometimes difficult to connect photography with Information technology. Nevertheless I wanted to share a photo with every blog, just as a means to share an instance of life on this earth….

This pic was taken in New Orleans

DSC_0221

Author: Abhik Roy