Apache Spark Interview Questions


What are the various levels of persistence in Apache Spark?

Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.

What is Shark?

Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background – to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark – offering compatibility with Hive metastore, queries and data.

List some use cases where Spark outperforms Hadoop in processing?

  • Sensor Data Processing –Apache Spark’s ‘In-memory computing’ works best here, as data is retrieved and combined from different sources.
  • Spark is preferred over Hadoop for real time querying of data
  • Stream Processing – For processing logs and detecting frauds in live streams for alerts, Apache Spark is the best solution.

What is a Sparse Vector?

A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.

What is RDD?

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –

  • Immutable – RDDs cannot be altered.
  • Resilient – If a node holding the partition fails the other node takes the data.

Explain about transformations and actions in the context of RDDs?

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

What are the languages supported by Apache Spark for developing big data applications?

Scala, Java, Python, R and Clojure.

Can you use Spark to access and analyse data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.

Is it possible to run Apache Spark on Apache Mesos?

Yes, Apache Spark can be run on the hardware clusters managed by Mesos.

Explain about the different cluster managers in Apache Spark?

The 3 different clusters managers supported in Apache Spark are:

  • YARN
  • Apache Mesos -Has rich resource scheduling capabilities and is well suited to run Spark along with other applications. It is advantageous when several users run interactive shells because it scales down the CPU allocation between commands.
  • Standalone deployments – Well suited for new deployments which only run and are easy to set up.

How can Spark be connected to Apache Mesos?

To connect Spark with Mesos-

  • Configure the spark driver program to connect to Mesos. Spark binary package should be in a location accessible by Mesos. (or)
  • Install Apache Spark in the same location as that of Apache Mesos and configure the property ‘spark.mesos.executor.home’ to point to the location where it is installed.

How can you minimize data transfers when working with Spark?

Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:

  • Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
  • Using Accumulators – Accumulators help update the values of variables in parallel while executing.
  • The most common way is to avoid operations ByKey, repartition or any other operations which trigger shuffles.

Why is there a need for broadcast variables when working with Apache Spark?

These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().

Is it possible to run Spark and Mesos along with Hadoop?

Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.

What is lineage graph?

The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.

How can you trigger automatic clean-ups in Spark to handle accumulated metadata?

You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.

Explain about the major libraries that constitute the Spark Ecosystem?

  • Spark MLib– Machine learning library in Spark for commonly used learning algorithms like clustering, regression, classification, etc.
  • Spark Streaming – This library is used to process real time streaming data.
  • Spark GraphX – Spark API for graph parallel computations with basic operators like joinVertices, subgraph, aggregateMessages, etc.
  • Spark SQL – Helps execute SQL like queries on Spark data using standard visualization or BI tools.

What are the benefits of using Spark with Apache Mesos?

It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.

What is the significance of Sliding Window operation?

Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.

What is a DStream?

Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations –

  • Transformations that produce a new DStream.
  • Output operations that write data to an external system.

When running Spark applications, is it necessary to install Spark on all the nodes of YARN cluster?

Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.

What is Catalyst framework?

Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.

Name a few companies that use Apache Spark in production?

Pinterest, Conviva, Shopify, Open Table

Which spark library allows reliable file sharing at memory speed across different cluster frameworks?


Why is BlinkDB used?

BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.

How can you compare Hadoop and Spark in terms of ease of use?

Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers – making it comparatively easier to use than Hadoop.

What are the common mistakes developers make when running Spark applications?

Developers often make the mistake of-

  • Hitting the web service several times by using multiple clusters.
  • Run everything on the local node instead of distributing it.

Developers need to be careful with this, as Spark makes use of memory for processing.

What is the advantage of a Parquet file?

Parquet file is a columnar format file that helps –

  • Limit I/O operations
  • Consumes less space
  • Fetches only required columns.

What are the various data sources available in SparkSQL?

  • Parquet file
  • JSON Datasets
  • Hive tables

How Spark uses Hadoop?

Spark has its own cluster management computation and mainly uses Hadoop for storage.

What are the key features of Apache Spark that you like?

  • Spark provides advanced analytic options like graph algorithms, machine learning, streaming data, etc
  • It has built-in APIs in multiple languages like Java, Scala, Python and R
  • It has good performance gains, as it helps run an application in the Hadoop cluster ten times faster on disk and 100 times faster in memory.

What do you understand by Pair RDD?

Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.

Which one will you choose for a project –Hadoop MapReduce or Apache Spark?

The answer to this question depends on the given project scenario – as it is known that Spark makes use of memory instead of network and disk I/O. However, Spark uses large amount of RAM and requires dedicated machine to produce effective results. So the decision to use Hadoop or Spark varies dynamically with the requirements of the project and budget of the organization.

Explain about the different types of transformations on DStreams?

  • Stateless Transformations- Processing of the batch does not depend on the output of the previous batch. Examples – map (), reduceByKey (), filter ().
  • Stateful Transformations- Processing of the batch depends on the intermediary results of the previous batch. Examples –Transformations that depend on sliding windows.

Explain about the popular use cases of Apache Spark?

Apache Spark is mainly used for

  • Iterative machine learning.
  • Interactive data analytics and processing.
  • Stream processing
  • Sensor data processing

Is Apache Spark a good fit for Reinforcement learning?

No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.

What is Spark Core?

It has all the basic functionalities of Spark, like – memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.

How can you remove the elements with a key present in any other RDD?

Use the subtractByKey () function

What is the difference between persist() and cache()?

persist () allows the user to specify the storage level whereas cache () uses the default storage level.

Explain what is Scala?

Scala is an object functional programming and scripting language for general software applications designed to express solutions in a concise manner.

What is a ‘Scala set’? What are methods through which operation sets are expressed?

Scala set is a collection of pairwise elements of the same type.  Scala set does not contain any duplicate elements.  There are two kinds of sets, mutable and immutable.

What is a ‘Scala map’?

Scala map is a collection of key or value pairs.  Based on its key any value can be retrieved.  Values are not unique but keys are unique in the Map.

What is the advantage of Scala?

a)      Less error prone functional style

b)      High maintainability and productivity

c)       High scalability

d)      High testability

e)      Provides features of concurrent programming

In what ways Scala is better than other programming language?

a)      The arrays uses regular generics, while in other language, generics are bolted on as an afterthought and are completely separate but have overlapping behaviours with arrays.

b)      Scala has immutable “val” as a first class language feature. The “val” of scala is similar to Java final variables.  Contents may mutate but top  reference is immutable.

c)       Scala lets ‘if blocks’, ‘for-yield loops’, and ‘code’ in braces to return a value. It is more preferable, and eliminates the need for a separate ternary operator.

d)      Singleton has singleton objects rather than C++/Java/ C# classic static.  It is a cleaner solution

e)       Persistent immutable collections are the default and built into the standard library.

f)       It has native tuples and a concise code

g)      It has no boiler plate code

45)      What are the Scala variables?

Values and variables are two shapes that come in Scala. A value variable is constant and cannot be changed once assigned.  It is immutable, while a regular variable, on the other hand, is mutable, and you can change the value.

The two types of variables are

var  myVar : Int=0;

val   myVal: Int=1;

Mention the difference between an object and a class ?

A class is a definition for a description.  It defines a type in terms of methods and composition of other types.  A class is a blueprint of the object. While, an object is a singleton, an instance of a class which is unique. An anonymous class is created for every object in the code, it inherits from whatever classes you declared object to implement.

What is recursion tail in scala?

‘Recursion’ is a function that calls itself. A function that calls itself, for example, a function ‘A’ calls function ‘B’, which calls the function ‘C’.  It is a technique used frequently in functional programming.  In order for a tail recursive, the call back to the function must be the last function to be performed.

What is ‘scala trait’ in scala?

‘Traits’ are used to define object types specified by the signature of the supported methods.  Scala allows to be partially implemented but traits may not have constructor parameters.  A trait consists of method and field definition, by mixing them into classes it can be reused.

When can you use traits?

There is no specific rule when you can use traits, but there is a guideline which you can consider.

a)      If the behaviour will not be reused, then make it a concrete class. Anyhow it is not a reusable behaviour.

b)      In order to inherit from it in Java code, an abstract class can be used.

c)       If efficiency is a priority then lean towards using a class

d)      Make it a trait if it might be reused in multiple and unrelated classes. In different parts of the class hierarchy only traits can be mixed into different parts.

e)      You can use abstract class, if you want to distribute it in compiled form and expects outside groups to write classes inheriting from it.

What is Case Classes?

Case classes provides a recursive decomposition mechanism via pattern matching, it is a regular classes which export their constructor parameter. The constructor parameters of case classes can be accessed directly and are treated as public values.

What is the use of tuples in scala?

Scala tuples combine a fixed number of items together so that they can be passed around as whole. A tuple is immutable and can hold objects with different types, unlike an array or list.

What is function currying in Scala?

Currying is the technique of transforming a function that takes multiple arguments into a function that takes a single argument Many of the same techniques as language like Haskell and LISP are supported by Scala. Function currying is one of the least used and misunderstood one.

What are implicit parameters in Scala?

Implicit parameter is the way that allows parameters of a method to be “found”.  It is similar to default parameters, but it has a different mechanism for finding the “default” value.  The implicit parameter is a parameter to method or constructor that is marked as implicit.  This means if a parameter value is not mentioned then the compiler will search for an “implicit” value defined within a scope.

What is a closure in Scala?

A closure is a function whose return value depends on the value of the variables declared outside the function.

What is Monad in Scala?

A monad is an object that wraps another object. You pass the Monad mini-programs, i.e functions, to perform the data manipulation of the underlying object, instead of manipulating the object directly.  Monad chooses how to apply the program to the underlying object.

What is Scala anonymous function?

In a source code, anonymous functions are called ‘function literals’ and at run time, function literals are instantiated into objects called function values.  Scala provides a relatively easy syntax for defining anonymous functions.

Explain ‘Scala higher order’ functions?

Scala allows the definition of higher order functions.  These are functions that take other functions as parameters, or whose result is a function.  In the following example, apply () function takes another function ‘f’ and a value ‘v’ and applies function to v.

Explain ‘Scala higher order’ functions?

The following are the key features of Apache Spark:

  1. Polyglot
  2. Speed
  3. Multiple Format Support
  4. Lazy Evaluation
  5. Real Time Computation
  6. Hadoop Integration
  7. Machine Learning

Let us look at these features in detail:

  1. Polyglot: Spark provides high-level APIs in Java, Scala, Python and R. Spark code can be written in any of these four languages. It provides a shell in Scala and Python. The Scala shell can be accessed through ./bin/spark-shell and Python shell through ./bin/pyspark from the installed directory.
  2. Speed: Spark runs upto 100 times faster than Hadoop MapReduce for large-scale data processing. Spark is able to achieve this speed through controlled partitioning. It manages data using partitions that help parallelize distributed data processing with minimal network traffic.
  3. Multiple Formats: Spark supports multiple data sources such as Parquet, JSON, Hive and Cassandra. The Data Sources API provides a pluggable mechanism for accessing structured data though Spark SQL. Data sources can be more than just simple pipes that convert data and pull it into Spark.
  4. Lazy Evaluation: Apache Spark delays its evaluation till it is absolutely necessary. This is one of the key factors contributing to its speed. For transformations, Spark adds them to a DAG of computation and only when the driver requests some data, does this DAG actually gets executed.
  5. Real Time Computation: Spark’s computation is real-time and has less latency because of its in-memory computation. Spark is designed for massive scalability and the Spark team has documented users of the system running production clusters with thousands of nodes and supports several computational models.
  6. Hadoop Integration: Apache Spark provides smooth compatibility with Hadoop. This is a great boon for all the Big Data engineers who started their careers with Hadoop. Spark is a potential replacement for the MapReduce functions of Hadoop, while Spark has the ability to run on top of an existing Hadoop cluster using YARN for resource scheduling.
  7. Machine Learning: Spark’s MLlib is the machine learning component which is handy when it comes to big data processing. It eradicates the need to use multiple tools, one for processing and one for machine learning. Spark provides data engineers and data scientists with a powerful, unified engine that is both fast and easy to use.

Learn Virtually Anywhere

Get Started Hadoop Online Training Now!

What are the languages supported by Apache Spark and which is the most popular one?

Apache Spark supports the following four languages: Scala, Java, Python and R. Among these languages, Scala and Python have interactive shells for Spark. The Scala shell can be accessed through./bin/spark-shell and the Python shell through ./bin/pyspark. Scala is the most used among them because Spark is written in Scala and it is the most popularly used for Spark.

What are benefits of Spark over MapReduce?

Spark has the following benefits over MapReduce:

  1. Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
  2. Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
  3. Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
  4. Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.

What is Yarn?

Similar to Hadoop, Yarn is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. Yarn is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on Yarn, the same way Hadoop Map Reduce can run on Yarn. Running Spark on Yarn necessitates a binary distribution of Spark as built on Yarn support.

Do you need to install Spark on all nodes of Yarn cluster?

No, because Spark runs on top of Yarn. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run Yarn. They include masterdeploy-modedriver-memoryexecutor-memoryexecutor-cores, and queue.

Is there any benefit of learning MapReduce if Spark is better than MapReduce?

Yes, MapReduce is a paradigm used by many big data tools including Spark as well. It is extremely relevant to use MapReduce when the data grows bigger and bigger. Most tools like Pig and Hive convert their queries into MapReduce phases to optimize them better.

Explain the concept of Resilient Distributed Dataset (RDD)?

RDD stands for Resilient Distribution Datasets. An RDD is a fault-tolerant collection of operational elements that run in parallel. The partitioned data in RDD is immutable and distributed in nature. There are primarily two types of RDD:

  1. Parallelized Collections: Here, the existing RDDs running parallel with one another.
  2. Hadoop Datasets: They perform functions on each file record in HDFS or other storage systems.

RDDs are basically parts of data that are stored in the memory distributed across many nodes. RDDs are lazily evaluated in Spark. This lazy evaluation is what contributes to Spark’s speed.

How do we create RDDs in Spark?

Spark provides two methods to create RDD:

1. By parallelizing a collection in your Driver program.

2. This makes use of SparkContext’s ‘parallelize’

method val DataArray = Array(2,4,6,8,10)
val DataRDD = sc.parallelize(DataArray)

3. By loading an external dataset from external storage like HDFS, HBase, shared file system.

What is Executor Memory in a Spark application?

Every spark application has same fixed heap size and fixed number of cores for a spark executor. The heap size is what referred to as the Spark executor memory which is controlled with the spark.executor.memory property of the –executor-memory flag. Every spark application will have one executor on each worker node. The executor memory is basically a measure on how much memory of the worker node will the application utilize.

Define Partitions in Apache Spark?

As the name suggests, partition is a smaller and logical division of data similar to ‘split’ in MapReduce. It is a logical chunk of a large distributed data set. Partitioning is the process to derive logical units of data to speed up the processing process. Spark manages data using partitions that help parallelize distributed data processing with minimal network traffic for sending data between executors. By default, Spark tries to read data into an RDD from the nodes that are close to it. Since Spark usually accesses distributed partitioned data, to optimize transformation operations it creates partitions to hold the data chunks. Everything in Spark is a partitioned RDD.

What operations does RDD support?

RDD (Resilient Distributed Dataset) is main logical data unit in Spark. An RDD has distributed a collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on the disk of different machines in a cluster. RDDs are immutable (Read Only) data structure. You can’t change original RDD, but you can always transform it into different RDD with all changes you want.

RDDs support two types of operations: transformations and actions.

Transformations: Transformations create new RDD from existing RDD like map, reduceByKey and filter we just saw. Transformations are executed on demand. That means they are computed lazily.

Actions: Actions return final results of RDD computations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

What do you understand by Transformations in Spark?

Transformations are functions applied on RDD, resulting into another RDD. It does not execute until an action occurs. map() and filter() are examples of transformations, where the former applies the function passed to it on each element of RDD and results into another RDD. The filter() creates a new RDD by selecting elements from current RDD that pass function argument.

val rawData=sc.textFile("path to/movies.txt")
val moviesData=rawData.map(x=>x.split("\t"))

As we can see here, rawData RDD is transformed into moviesData RDD. Transformations are lazily evaluated.

Define Actions in Spark?

An action helps in bringing back the data from RDD to the local machine. An action’s execution is the result of all previously created transformations. Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

reduce() is an action that implements the function passed again and again until one value if left. take()action takes all the values from RDD to a local node.


As we can see here, moviesData RDD is saved into a text file called MoviesData.txt.

Define functions of SparkCore?

Spark Core is the base engine for large-scale parallel and distributed data processing. The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform for distributed ETL application development. SparkCore performs various important functions like memory management, monitoring jobs, fault-tolerance, job scheduling and interaction with storage systems. Further, additional libraries, built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for:

  1. Memory management and fault recovery
  2. Scheduling, distributing and monitoring jobs on a cluster
  3. Interacting with storage systems

Name the components of Spark Ecosystem?

  1. Spark Core: Base engine for large-scale parallel and distributed data processing
  2. Spark Streaming: Used for processing real-time streaming data
  3. Spark SQL: Integrates relational processing with Spark’s functional programming API
  4. GraphX: Graphs and graph-parallel computation
  5. MLlib: Performs machine learning in Apache Spark

How is Streaming implemented in Spark? Explain with examples?

Spark Streaming is used for processing real-time streaming data. Thus it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams. The fundamental stream unit is DStream which is basically a series of RDDs (Resilient Distributed Datasets) to process the real-time data. The data from different sources like Flume, HDFS is streamed and finally processed to file systems, live dashboards and databases. It is similar to batch processing as the input data is divided into streams like batches.

Is there an API for implementing graphs in Spark?

GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph.

The property graph is a directed multi-graph which can have multiple edges in parallel. Every edge and vertex have user defined properties associated with it. Here, the parallel edges allow multiple relationships between the same vertices. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.

To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and mapReduceTriplets) as well as an optimized variant of the Pregel API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

What is PageRank in GraphX?

PageRank measures the importance of each vertex in a graph, assuming an edge from u to v represents an endorsement of v’s importance by u. For example, if a Twitter user is followed by many others, the user will be ranked highly.

GraphX comes with static and dynamic implementations of PageRank as methods on the PageRank Object. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphOps allows calling these algorithms directly as methods on Graph.

How is machine learning implemented in Spark?

MLlib is scalable machine learning library provided by Spark. It aims at making machine learning easy and scalable with common learning algorithms and use cases like clustering, regression filtering, dimensional reduction, and alike.

Is there a module to implement SQL in Spark? How does it work?

Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

Spark SQL integrates relational processing with Spark’s functional programming. Further, it provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool.

The following are the four libraries of Spark SQL.

  1. Data Source API
  2. DataFrame API
  3. Interpreter & Optimizer
  4. SQL Service

What is a Parquet file?

Parquet is a columnar format file supported by many other data processing systems. Spark SQL performs both read and write operations with Parquet file and consider it be one of the best big data analytics formats so far.

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows:

  1. Columnar storage limits IO operations.
  2. It can fetch specific columns that you need to access.
  3. Columnar storage consumes less space.
  4. It gives better-summarized data and follows type-specific encoding.

How can Apache Spark be used alongside Hadoop?

The best part of Apache Spark is its compatibility with Hadoop. As a result, this makes for a very powerful combination of technologies. Here, we will be looking at how Spark can benefit from the best of Hadoop. Using Spark and Hadoop together helps us to leverage Spark’s processing to utilize the best of Hadoop’s HDFS and YARN.

Hadoop components can be used alongside Spark in the following ways:

  1. HDFS: Spark can run on top of HDFS to leverage the distributed replicated storage.
  2. MapReduce: Spark can be used along with MapReduce in the same Hadoop cluster or separately as a processing framework.
  3. YARN: Spark applications can also be run on YARN (Hadoop NextGen).
  4. Batch & Real Time Processing: MapReduce and Spark are used together where MapReduce is used for batch processing and Spark for real-time processing.

What is RDD Lineage?

Spark does not support data replication in the memory and thus, if any data is lost, it is rebuild using RDD lineage. RDD lineage is a process that reconstructs lost data partitions. The best is that RDD always remembers how to build from other datasets.

What is Spark Driver?

Spark Driver is the program that runs on the master node of the machine and declares transformations and actions on data RDDs. In simple terms, a driver in Spark creates SparkContext, connected to a given Spark Master.
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.

What file systems does Spark support?

The following three file systems are supported by Spark:

  1. Hadoop Distributed File System (HDFS).
  2. Local File system.
  3. Amazon S3

List the functions of Spark SQL?

Spark SQL is capable of:

  1. Loading data from a variety of structured sources.
  2. Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau.
  3. Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more.

What is Spark Executor?

When SparkContext connects to a cluster manager, it acquires an Executor on nodes in the cluster. Executors are Spark processes that run computations and store the data on the worker node. The final tasks by SparkContext are transferred to executors for their execution.

Name types of Cluster Managers in Spark?

The Spark framework supports three major types of Cluster Managers:

  1. Standalone: A basic manager to set up a cluster.
  2. Apache Mesos: Generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications.
  3. Yarn: Responsible for resource management in Hadoop.

What do you understand by worker node?

Worker node refers to any node that can run the application code in a cluster. The driver program must listen for and accept incoming connections from its executors and must be network addressable from the worker nodes.

Worker node is basically the slave node. Master node assigns work and worker node actually performs the assigned tasks. Worker nodes process the data stored on the node and report the resources to the master. Based on the resource availability, the master schedule tasks.

Illustrate some demerits of using Spark?

The following are some of the demerits of using Apache Spark:

  1. Since Spark utilizes more storage space compared to Hadoop and MapReduce, there may arise certain problems.
  2. Developers need to be careful while running their applications in Spark.
  3. Instead of running everything on a single node, the work must be distributed over multiple clusters.
  4. Spark’s “in-memory” capability can become a bottleneck when it comes to cost-efficient processing of big data.
  5. Spark consumes a huge amount of data when compared to Hadoop.

Can you use Spark to access and analyze data stored in Cassandra databases?

Yes, it is possible if you use Spark Cassandra Connector.To connect Spark to a Cassandra cluster, a Cassandra Connector will need to be added to the Spark project. In the setup, a Spark executor will talk to a local Cassandra node and will only query for local data. It makes queries faster by reducing the usage of the network to send data between Spark executors (to process data) and Cassandra nodes (where data lives).

What are broadcast variables?

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

Explain accumulators in Apache Spark?

Accumulators are variables that are only added through an associative and commutative operation. They are used to implement counters or sums. Tracking accumulators in the UI can be useful for understanding the progress of running stages. Spark natively supports numeric accumulators. We can create named or unnamed accumulators.

Explain Caching in Spark Streaming?

DStreams allow developers to cache/ persist the stream’s data in memory. This is useful if the data in the DStream will be computed multiple times. This can be done using the persist() method on a DStream. For input streams that receive data over the network (such as Kafka, Flume, Sockets, etc.), the default persistence level is set to replicate the data to two nodes for fault-tolerance.

Does Apache Spark provide checkpoints?

Checkpoints are similar to checkpoints in gaming. They make it run 24/7 and make it resilient to failures unrelated to the application logic.

Lineage graphs are always useful to recover RDDs from a failure but this is generally time-consuming if the RDDs have long lineage chains. Spark has an API for checkpointing i.e. a REPLICATE flag to persist. However, the decision on which data to checkpoint – is decided by the user. Checkpoints are useful when the lineage graphs are long and have wide dependencies.

How Spark uses Akka?

Spark uses Akka basically for scheduling. All the workers request for a task to master after registering. The master just assigns the task. Here Spark uses Akka for messaging between the workers and masters.

What do you understand by Lazy Evaluation?

Spark is intellectual in the manner in which it operates on data. When you tell Spark to operate on a given dataset, it heeds the instructions and makes a note of it, so that it does not forget – but it does nothing, unless asked for the final result. When a transformation like map() is called on an RDD, the operation is not performed immediately. Transformations in Spark are not evaluated till you perform an action. This helps optimize the overall data processing workflow.

What do you understand by SchemaRDD in Apache Spark RDD?

SchemaRDD is an RDD that consists of row objects (wrappers around the basic string or integer arrays) with schema information about the type of data in each column.

SchemaRDD was designed as an attempt to make life easier for developers in their daily routines of code debugging and unit testing on SparkSQL core module. The idea can boil down to describing the data structures inside RDD using a formal description similar to the relational database schema. On top of all basic functions provided by common RDD APIs, SchemaRDD also provides some straightforward relational query interface functions that are realized through SparkSQL.

Now, it is officially renamed to DataFrame API on Spark’s latest trunk.

How is Spark SQL different from HQL and SQL?

Spark SQL is a special component on the Spark Core engine that supports SQL and Hive Query Language without changing any syntax. It is possible to join SQL table and HQL table to Spark SQL.

Explain a scenario where you will be using Spark Streaming?

When it comes to Spark Streaming, the data is streamed in real-time onto our Spark program.

Twitter Sentiment Analysis is a real-life use case of Spark Streaming. Trending Topics can be used to create campaigns and attract a larger audience. It helps in crisis management, service adjusting and target marketing.

Sentiment refers to the emotion behind a social media mention online. Sentiment Analysis is categorizing the tweets related to a particular topic and performing data mining using Sentiment Automation Analytics Tools.

Spark Streaming can be used to gather live tweets from around the world into the Spark program. This stream can be filtered using Spark SQL and then we can filter tweets based on the sentiment. The filtering logic will be implemented using MLlib where we can learn from the emotions of the public and change our filtering scale accordingly.

What is Immutable?

Once created and assign a value, it’s not possible to change, this property is called Immutability. Spark is by default immutable, it’s not allows updates and modifications. Please note data collection is not immutable, but data value is immutable.

What is Distributed?

RDD can automatically the data is distributed across different parallel computing nodes.

What is Catchable?

keep all the data in-memory for computation, rather than going to the disk. So Spark can catch the data 100 times faster than Hadoop.

What is Spark engine responsibility?

Spark responsible for scheduling, distributing, and monitoring the application across the cluster.

How spark partition the data?

Spark use map-reduce API to do the partition the data. In Input format we can create number of partitions. By default HDFS block size is partition size (for best performance), but its’ possible to change partition size like Split.

How Spark store the data?

Spark is a processing engine, there is no storage engine. It can retrieve data from any storage engine like HDFS, S3 and other data resources.

What is SparkCore functionalities?