Category Archives: Machine Learning

Understanding Data Types and Data Structures in Scala and Apache Spark

In this blog post, I would provide a brief overview of the data structures and data types available in the Scala Programming Language. Machine learning algorithms using Apache Spark and the ML package uses complex data structures like vectors, arrays etc. So developing an understanding of these becomes a vital element if we are interested in building machine learning algorithms in Apache Spark.

Data Types in Scala

Just like any programming language, Scala supports Byte, Short, Int, Long, Float, Double, Char, String and Boolean data types.

Values

Values are like constants, once assigned, they cannot be changed, they are immutable. Notice in the below example, how Scala complains when we try to assign a new value to an already declared value.

pic1_spark_data_types

Variables

A variable is mutable, so we can assign new values to it during its lifetime. Notice, how I assign new values to the variable.

pic2_spark_data_types

Data Structures / Collections

Data Structures or Collections in Scala comprise of Arrays, Lists, Tuples, and Maps.

Arrays

They are a list of values of same data types

They are mutable

Index in Array starts with 0

Here’s an example

pic3_spark_data_types

Lists

They are a list of values of same data types

They are immutable

Index in Array starts with 0

pic4_spark_data_types

Tuples

Container of 1 or more values

Index starts from 1

Values can be of different data types

Notice how I am creating a tuple of data type Integer, String and boolean. I was also unable to change the value of an element in the tuple

pic5_spark_data_types

Map

Maps store key – value pairs (click on the below image to look at an example of a map)

pic7_spark_data_types

Data Representations specific to Apache Spark ML

Below are some special data representations that are used extensively in Apache Spark ML

Dense Vector

The density of a vector is defined by the number of empty values it has. The lesser the empty values, more is the density of the vector. A vector can be represented in Sparse or Dense form as shown below. Most Apache Spark ML packages need a vector to be represented in dense form.

Dense Vector Representation

(4.2,3.1,6.0)

Sparse Vector Representation

Original : (2.6,0.0,0.0,3.1,0.0)

Representation: (5, (0,3), (2.6, 3.1)

Labeled Point Representation

Most of the ML packages expects input data to be in form of a labeled point when we are building Machine Learning Applications.

A labeled point contains a “Label” (target variable) and a list of “features” (the predictors). The list of features is represented as a dense vector.

Example

Labeled Point (1.0, Vectors.dense(2.6,0.0,0.0,3.1,0.0)



———————————————–

Author: Abhik Roy