Pyspark Training

Course Features

Course Details

Course Curriculam
Introduction to Big Data Hadoop and Spark
Learning Objective:In this module, you will understand Big Data, the limitations of the existing solutions for Big Data problem, how Hadoop solves the Big Data problem, Hadoop ecosystem components, Hadoop Architecture, HDFS, Rack Awareness, and Replication. You will learn about the Hadoop Cluster Architecture, important configuration files in a Hadoop Cluster. You will also get an introduction to Spark, why it is used and understanding of the difference between batch processing and real-time processing.
Topics:
What is Big Data
Big Data Customer Scenarios
Limitations and Solutions of Existing Data Analytics Architecture with Uber Use Case
How Hadoop Solves the Big Data Problem
What is Hadoop
Hadoop’s Key Characteristics
Hadoop Ecosystem and HDFS
Hadoop Core Components
Rack Awareness and Block Replication
YARN and its Advantage
Hadoop Cluster and its Architecture
Hadoop: Different Cluster Modes
Big Data Analytics with Batch & Real-Time Processing
Why Spark is Needed
What is Spark
How Spark Differs from its Competitors
Spark at eBay
Spark’s Place in Hadoop Ecosystem

Introduction to Python for Apache Spark
Learning Objective:In this module, you will learn basics of Python programming and learn different types of sequence structures, related operations and their usage. You will also learn diverse ways of opening, reading, and writing to files.
Topics:
Overview of Python
Different Applications where Python is Used
Values, Types, Variables
Operands and Expressions
Conditional Statements
Loops
Command Line Arguments
Writing to the Screen
Python files I/O Functions
Numbers
Strings and related operations
Tuples and related operations
Lists and related operations
Dictionaries and related operations
Sets and related operations
Hands-On:
Creating “Hello World” code
Demonstrating Conditional Statements
Demonstrating Loops
Tuple - properties, related operations, compared with list
List - properties, related operations
Dictionary - properties, related operations
Set - properties, related operations

Functions, OOPs, Modules, Errors and Exceptions in Python
Learning Objective: In this Module, you will learn how to create generic python scripts, how to address errors/exceptions in code and finally how to extract/filter content using regex.
Topics:
Functions
Function Parameters
Global Variables
Variable Scope and Returning Values
Lambda Functions
Object-Oriented Concepts
Standard Libraries
Modules Used in Python
The Import Statements
Module Search Path
Package Installation Ways
Errors and Exception Handling
Handling Multiple Exceptions
Hands-On:
Functions - Syntax, Arguments, Keyword Arguments, Return Values
Lambda - Features, Syntax, Options, Compared with the Functions
Sorting - Sequences, Dictionaries, Limitations of Sorting
Errors and Exceptions - Types of Issues, Remediation
Packages and Module - Modules, Import Options, sys Path

Deep Dive into Apache Spark Framework
Learning Objective: In this module, you will understand Apache Spark in depth and you will be learning about various Spark components, you will be creating and running various spark applications. At the end, you will learn how to perform data ingestion using Sqoop.
Topics:
Spark Components & its Architecture
Spark Deployment Modes
Introduction to PySpark Shell
Submitting PySpark Job
Spark Web UI
Writing your first PySpark Job Using Jupyter Notebook
Data Ingestion using Sqoop
Hands-On:
Building and Running Spark Application
Spark Application Web UI
Understanding different Spark Properties

Playing with Spark RDDs
Learning Objective: In this module, you will learn about Spark - RDDs and other RDD related manipulations for implementing business logics (Transformations, Actions, and Functions performed on RDD).
Topics:
Challenges in Existing Computing Methods
Probable Solution & How RDD Solves the Problem
What is RDD, It’s Operations, Transformations & Actions
Data Loading and Saving Through RDDs
Key-Value Pair RDDs
Other Pair RDDs, Two Pair RDDs
RDD Lineage
RDD Persistence
WordCount Program Using RDD Concepts
RDD Partitioning & How it Helps Achieve Parallelization
Passing Functions to Spark
Hands-On:
Loading data in RDDs
Saving data through RDDs
RDD Transformations
RDD Actions and Functions
RDD Partitions
WordCount through RDDs

DataFrames and Spark SQL
Learning Objective:In this module, you will learn about SparkSQL which is used to process structured data with SQL queries. You will learn about data-frames and datasets in Spark SQL along with different kind of SQL operations performed on the data-frames. You will also learn about the Spark and Hive integration.
Topics:
Need for Spark SQL
What is Spark SQL
Spark SQL Architecture
SQL Context in Spark SQL
Schema RDDs
User Defined Functions
Data Frames & Datasets
Interoperating with RDDs
JSON and Parquet File Formats
Loading Data through Different Sources
Spark-Hive Integration
Hands-On:
Spark SQL – Creating data frames
Loading and transforming data through different sources
Stock Market Analysis
Spark-Hive Integration

Machine Learning using Spark MLlib
Learning Objective: In this module, you will learn about why machine learning is needed, different Machine Learning techniques/algorithms and their implementation using Spark MLlib.
Topics:
Why Machine Learning
What is Machine Learning
Where Machine Learning is used
Face Detection: USE CASE
Different Types of Machine Learning Techniques
Introduction to MLlib
Features of MLlib and MLlib Tools
Various ML algorithms supported by MLlib

Deep Dive into Spark MLlib
Learning Objective:In this module, you will be implementing various algorithms supported by MLlib such as Linear Regression, Decision Tree, Random Forest and many more.
Topics:
Supervised Learning: Linear Regression, Logistic Regression, Decision Tree, Random Forest
Unsupervised Learning: K-Means Clustering & How It Works with MLlib
Hands-On:
K- Means Clustering
Linear Regression
Logistic Regression
Decision Tree
Random Forest

Understanding Apache Kafka and Apache Flume
Learning Objective: In this module, you will understand Kafka and Kafka Architecture. Afterwards you will go through the details of Kafka Cluster and you will also learn how to configure different types of Kafka Cluster. After that you will see how messages are produced and consumed using Kafka API’s in Java. You will also get an introduction to Apache Flume, its basic architecture and how it is integrated with Apache Kafka for event processing. You will learn how to ingest streaming data using flume.
Topics:
Need for Kafka
What is Kafka
Core Concepts of Kafka
Kafka Architecture
Where is Kafka Used
Understanding the Components of Kafka Cluster
Configuring Kafka Cluster
Kafka Producer and Consumer Java API
Need of Apache Flume
What is Apache Flume
Basic Flume Architecture
Flume Sources
Flume Sinks
Flume Channels
Flume Configuration
Integrating Apache Flume and Apache Kafka
Hands-On:
Configuring Single Node Single Broker Cluster
Configuring Single Node Multi-Broker Cluster
Producing and consuming messages through Kafka Java API
Flume Commands
Setting up Flume Agent
Streaming Twitter Data into HDFS

Apache Spark Streaming - Processing Multiple Batches
Learning Objective:In this module, you will work on Spark streaming which is used to build scalable fault-tolerant streaming applications. You will learn about DStreams and various Transformations performed on the streaming data. You will get to know about commonly used streaming operators such as Sliding Window Operators and Stateful Operators.
Topics:
Drawbacks in Existing Computing Methods
Why Streaming is Necessary
What is Spark Streaming
Spark Streaming Features
Spark Streaming Workflow
How Uber Uses Streaming Data
Streaming Context & DStreams
Transformations on DStreams
Describe Windowed Operators and Why it is Useful
Important Windowed Operators
Slice, Window and ReduceByWindow Operators
Stateful Operators
Hands-On:
WordCount Program using Spark Streaming
Perform Twitter Sentimental Analysis Using Spark Streaming

Apache Spark Streaming - Data Sources
Learning Objective:In this module, you will learn about the different streaming data sources such as Kafka and flume. At the end of the module, you will be able to create a spark streaming application.
Topics:
Apache Spark Streaming: Data Sources
Streaming Data Source Overview
Apache Flume and Apache Kafka Data Sources
Example: Using a Kafka Direct Data Source
Creating Streaming Application from scratch
Hands-On:
Spark Streaming using a Kafka Direct Data Source
Creating Streaming Application from scratch
This course does not have any sections.

More Courses by this Instructor