Monthly Archives: July 2016

Apache Spark Streaming Deep Dive – Part1

In this and the next couple of blog posts, we would be looking at Apache Spark Streaming. We will begin with the reference architecture of Streaming and get an understanding of the building blocks needed to build streaming applications. Then we would dive deeper and look at some code that would be able to accomplish stream analytic s.

What is Apache Spark Streaming –

In simplest terms, Streaming lets you perform analytic s on real time data. By real time data, I mean data that is arriving in streams. Historically, analytic s has been performed on data at rest.

Refer the below picture to get a visual feel of what real time analytic s is all about.




The Apache Spark Streaming Architecture

Refer to the below diagram for understanding the architecture of Apache Spark Streaming. The Streaming Architecture is built on top of the Spark Architecture, so as seen below, we continue to use the master node where the driver program is initiated. The Spark Context is used to create a Streaming Context. Live data streams enter the processing infrastructure through one of the worker / executor nodes. The executor node where data streams enter runs a special task called a long task. The task is called a long task because this task will never end as it will contentiously process incoming streams of data. The data stream is then converted into an RDD and distributed to the other executor nodes for processing.

Spark Streaming processes the RDDs in micro batch intervals. Example, a micro batch interval of 60 seconds means that it will collect all the data for 60 seconds and then process this data set as a batch.

The stream architecture is built on a queue and de queue process.


For an introduction on Spark Architecture refer to my earlier blog


Taming Apache Spark with Python – Programming examples Part1


Use cases of Apache Spark – The list of use cases continue to grow……network fraud detection, credit fraud detection, sentiment analysis, detecting epidemics….

In upcoming blogs we shall dive into writing some code and look at this technology hands on.

Pictoblog – Technical blogs could sometimes use some pretty photographs.. Well, the problem is, it is sometimes difficult to connect photography with Information technology. Nevertheless I wanted to share a photo with every blog, just as a means to share an instance of life on this earth….

This pic was taken in Leh / Ladakh from my car.  They were running on this rocky terrain with no fear of falling and getting hurt



Abhik Roy