Spark / R for Data Scientists
Course Features
Course Details
Overview
The Spark / R for Data Scientists is an intermediate-level course that will acquaint data scientists with practical data science techniques on the Spark platform, while also enabling them to work with other systems in current, real-world Data Science environments.
The course is aimed at helping data scientists effectively transition to an R/Spark/Hadoop environment by familiarizing them with the relevant tools as well as machine learning libraries. After completing the program, participants will be able to conduct similar statistical and machine learning analyses on an R or Spark environment which they have previously been conducting on SAS or similar environments.
Curriculum
Course Overview (0.5 hr lecture)
• Data and problem set
• Accessing the cluster, the data, and the tools
• The Continuous Workshop approach
• “Let’s build a model together”
• Focus on analysis, exploration, data munging, algorithms
• Tooling and fundamentals to get the job done
Spark Overview (1 hr lecture, 2 hr lab)
• Data Science: The State of the Art
• Hadoop, Yarn, and Spark
• Architectural Overview
• MLib Overview
• HDFS data – Accessing
• Lab Focus
• Working with HDFS data
• Distributed vs. Local Run Modes
• Spark vs. Other tools (when is Spark the right tool for the job?)
• Spark vs. SAS
• Spark Languages (Java, R, Python, and Scala)
• Hello, Spark
Spark Overview (0.75 hr lecture, 1 hr lab)
• Spark Core
• Spark SQL
• Spark and Hive
• Lab
• MLib
• Spark Streaming
• Spark API
DataFrames (0.75 hr lecture, 1 hr lab)
• DataFrames and Resilient Distributed Datasets (RDDs)
• Partitions
• Adding variables to a DataFrame
• DataFrame Types
• DataFrame Operations
• Dependent vs. Independent variables
• Map/Reduce with DataFrames
Spark SQL (0.5 hr lecture, 1-2 hr lab)
• Spark SQL Overview
• Data stores: HDFS, Cassandra, HBase, Hive, and S3
• Table Definitions
• Queries
Spark MLib (0.5 hr lecture, 3 hr+ lab)
• MLib overview
• MLib Algorithms Overview
• Classification Algorithms
• Regression Algorithms
• Lab Focus
• Brief Comparison to SAS
• Here’s your split, how to tune regression
• Decision Trees and forests
• Lab Focus
• Brief Comparison to SAS
• Stepwise approach to Decision Trees
• Working with Exit Criteria
• Recommendation with ALS
• Clustering Algorithms
• Lab Focus
• Key Clustering Algorithms
• Choosing Clustering Algorithms
• Working with key algorithms
• Machine Learning Pipelines
• Linear Algebra (SVD, PCA)
• Statistics in MLib
Spark Streaming (0.25 hr lecture, 0 - 1 hr lab)
• Streaming overview
Streaming with Kafka (0.25-5 hr lecture, 0 - 1 hr lab)
Kafka overview
• Kafka and Spark Streaming
Data Flow with NiFi (0.25 hr lecture, 0 - 1 hr lab)
• Apache NiFi overview
• NiFi data flows with Spark/R
Cluster Mode (0.25hr lecture, 0 - 0.5 hr lab)
• Standalone Cluster
• Masters and Workers
Spark - the Big Picture (0.5-1 hr lecture, 0 - 2 hr lab)
• Spark in Real-Time and near-Real-Time Decision Support Systems
• Spark in the Enterprise
• Best Practices