Spark / R for Data Scientists

Course Features

Course Details


The Spark / R for Data Scientists is an intermediate-level course that will acquaint data scientists with practical data science techniques on the Spark platform, while also enabling them to work with other systems in current, real-world Data Science environments.

The course is aimed at helping data scientists effectively transition to an R/Spark/Hadoop environment by familiarizing them with the relevant tools as well as machine learning libraries. After completing the program, participants will be able to conduct similar statistical and machine learning analyses on an R or Spark environment which they have previously been conducting on SAS or similar environments.


Course Overview (0.5 hr lecture)

•   Data and problem set
•   Accessing the cluster, the data, and the tools
•   The Continuous Workshop approach
•   “Let’s build a model together”
•   Focus on analysis, exploration, data munging, algorithms
•   Tooling and fundamentals to get the job done

Spark Overview (1 hr lecture, 2 hr lab)

•   Data Science: The State of the Art
•   Hadoop, Yarn, and Spark
•   Architectural Overview
•   MLib Overview
•   HDFS data – Accessing
•   Lab Focus
•   Working with HDFS data
•   Distributed vs. Local Run Modes
•   Spark vs. Other tools (when is Spark the right tool for the job?)
•   Spark vs. SAS
•   Spark Languages (Java, R, Python, and Scala)
•   Hello, Spark

Spark Overview (0.75 hr lecture, 1 hr lab)

•   Spark Core
•   Spark SQL
•   Spark and Hive
•   Lab
•   MLib
•   Spark Streaming
•   Spark API

DataFrames (0.75 hr lecture, 1 hr lab)

•   DataFrames and Resilient Distributed Datasets (RDDs)
•   Partitions
•   Adding variables to a DataFrame
•   DataFrame Types
•   DataFrame Operations
•   Dependent vs. Independent variables
•   Map/Reduce with DataFrames

Spark SQL (0.5 hr lecture, 1-2 hr lab)

•   Spark SQL Overview
•   Data stores: HDFS, Cassandra, HBase, Hive, and S3
•   Table Definitions
•   Queries

Spark MLib (0.5 hr lecture, 3 hr+ lab)

•   MLib overview
•   MLib Algorithms Overview
•   Classification Algorithms
•   Regression Algorithms
•   Lab Focus
•   Brief Comparison to SAS
•   Here’s your split, how to tune regression
•   Decision Trees and forests
•   Lab Focus
•   Brief Comparison to SAS
•   Stepwise approach to Decision Trees
•   Working with Exit Criteria
•   Recommendation with ALS
•   Clustering Algorithms
•   Lab Focus
•   Key Clustering Algorithms
•   Choosing Clustering Algorithms
•   Working with key algorithms
•   Machine Learning Pipelines
•   Linear Algebra (SVD, PCA)
•   Statistics in MLib

Spark Streaming (0.25 hr lecture, 0 - 1 hr lab)

•   Streaming overview

Streaming with Kafka (0.25-5 hr lecture, 0 - 1 hr lab)

Kafka overview
•   Kafka and Spark Streaming

Data Flow with NiFi (0.25 hr lecture, 0 - 1 hr lab)

•   Apache NiFi overview
•   NiFi data flows with Spark/R

Cluster Mode (0.25hr lecture, 0 - 0.5 hr lab)

•   Standalone Cluster
•   Masters and Workers

Spark - the Big Picture (0.5-1 hr lecture, 0 - 2 hr lab)

•   Spark in Real-Time and near-Real-Time Decision Support Systems
•   Spark in the Enterprise
•   Best Practices

This course does not have any sections.

More Courses by this Instructor