azure data factory interview questions and answers

1.What is Azure Data Factory?

Ans: Cloud-based integration service that allows creating data-driven workflows in the cloud for orchestrating and automating data movement and data transformation.

  • Using Azure data factory, you can create and schedule the data-driven workflows(called pipelines) that can ingest data from disparate data stores.
  • It can process and transform the data by using compute services such as HDInsight Hadoop, Spark, Azure Data Lake Analytics, and Azure Machine Learning.

2.Why do we need Azure Data Factory?

  • The amount of data generated these days is huge and this data comes from different sources. When we move this particular data to the cloud, there are few things needed to be taken care of.
  • Data can be in any form as it comes from different sources and these different sources will transfer or channelize the data in different ways and it can be in a different format. When we bring this data to the cloud or particular storage we need to make sure that this data is well managed. i.e you need to transform the data, delete unnecessary parts. As per moving the data is concerned, we need to make sure that data is picked from different sources and bring it at one common place then store it and if required we should transform into more meaningful.
  • This can be also done by traditional data warehouse as well but there are certain disadvantages. Sometimes we are forced to go ahead and have custom applications that deal with all these processes individually which is time-consuming and integrating all these sources is a huge pain. we need to figure out a way to automate this process or create proper workflows.
  • Data factory helps to orchestrate this complete process into more manageable or organizable manner.

3.Azure Datafactory process and filter files to process

Ans: I have a pipeline that processes some files, and in some cases “groups” of files. Meaning the files should be processed together and are correlated with a timestamp. Ex. Timestamp#Customer.

4.How do I get the changed rows using CT incrementally in Azure Data factory?

Ans: I have source as SQL and destination as Azure SQL database. I need to get only the changed rows to copy to my destination using Change tracking approach. I am running this incrementally using Azure …

5.Azure Data Factory appending large number of files having different schema from csv files?

Ans: We have 500 CSV files uploaded to an Azure storage container. These files use 4 different schemas, meaning that they have few different columns and some columns are common across all files. We are .

6.New features for SSIS in Data Factory ?

Ans: Since the initial public preview release in 2017, Data Factory has added the following features for SSIS:

  • Support for three more configurations/variants of Azure SQL Database to host the SSIS database (SSISDB) of projects/packages:
  • SQL Database with virtual network service endpoints
  • SQL Managed Instance
  • Elastic pool
  • Support for an Azure Resource Manager virtual network on top of a classic virtual network to be deprecated in the future, which lets you inject/join your Azure-SSIS integration runtime to a virtual network configured for SQL Database with virtual network service endpoints/MI/on-premises data access. For more information, see also Join an Azure-SSIS integration runtime to a virtual network.
  • Support for Azure Active Directory (Azure AD) authentication and SQL authentication to connect to the SSISDB, allowing Azure AD authentication with your Data Factory managed identity for Azure resources
  • Support for bringing your existing SQL Server license to earn substantial cost savings from the Azure Hybrid Benefit option
  • Support for Enterprise Edition of the Azure-SSIS integration runtime that lets you use advanced/premium features, a custom setup interface to install additional components/extensions, and a partner ecosystem. For more information, see also Enterprise Edition, Custom Setup, and 3rd Party Extensibility for SSIS in ADF.
  • Deeper integration of SSIS in Data Factory that lets you invoke/trigger first-class Execute SSIS Package activities in Data Factory pipelines and schedule them via SSMS. For more information, see also Modernize and extend your ETL/ELT workflows with SSIS activities in ADF pipelines.

7.What are the top-level concepts of Azure Data Factory?

Ans: An Azure subscription can have one or more Azure Data Factory instances (or data factories). Azure Data Factory contains four key components that work together as a platform on which you can compose data-driven workflows with steps to move and transform data.

Pipelines

A data factory can have one or more pipelines. A pipeline is a logical grouping of activities to perform a unit of work. Together, the activities in a pipeline perform a task. For example, a pipeline can contain a group of activities that ingest data from an Azure blob and then run a Hive query on an HDInsight cluster to partition the data. The benefit is that you can use a pipeline to manage the activities as a set instead of having to manage each activity individually. You can chain together the activities in a pipeline to operate them sequentially, or you can operate them independently, in parallel.

Data flows

Data flows are objects that you build visually in Data Factory which transform data at scale on backend Spark services. You do not need to understand programming or Spark internals. Just design your data transformation intent using graphs (Mapping) or spreadsheets (Wrangling).

Activities

Activities represent a processing step in a pipeline. For example, you can use a Copy activity to copy data from one data store to another data store. Similarly, you can use a Hive activity, which runs a Hive query on an Azure HDInsight cluster to transform or analyze your data. Data Factory supports three types of activities: data movement activities, data transformation activities, and control activities.

Datasets

Datasets represent data structures within the data stores, which simply point to or reference the data you want to use in your activities as inputs or outputs.

Linked services

Linked services are much like connection strings, which define the connection information needed for Data Factory to connect to external resources. Think of it this way: A linked service defines the connection to the data source, and a dataset represents the structure of the data. For example, an Azure Storage linked service specifies the connection string to connect to the Azure Storage account. And an Azure blob dataset specifies the blob container and the folder that contains the data.

Linked services have two purposes in Data Factory:

  • To represent a data store that includes, but is not limited to, a SQL Server instance, an Oracle database instance, a file share, or an Azure Blob storage account. For a list of supported data stores, see Copy Activity in Azure Data Factory.
  • To represent a compute resource that can host the execution of an activity. For example, the HDInsight Hive activity runs on an HDInsight Hadoop cluster. For a list of transformation activities and supported compute environments, see Transform data in Azure Data Factory.

Triggers

Triggers represent units of processing that determine when a pipeline execution is kicked off. There are different types of triggers for different types of events.

Pipeline runs

A pipeline run is an instance of a pipeline execution. You usually instantiate a pipeline run by passing arguments to the parameters that are defined in the pipeline. You can pass the arguments manually or within the trigger definition.

Parameters

Parameters are key-value pairs in a read-only configuration. You define parameters in a pipeline, and you pass the arguments for the defined parameters during execution from a run context. The run context is created by a trigger or from a pipeline that you execute manually. Activities within the pipeline consume the parameter values.

A dataset is a strongly typed parameter and an entity that you can reuse or reference. An activity can reference datasets, and it can consume the properties that are defined in the dataset definition.

A linked service is also a strongly typed parameter that contains connection information to either a data store or a compute environment. It’s also an entity that you can reuse or reference.

Control flows

Control flows orchestrate pipeline activities that include chaining activities in a sequence, branching, parameters that you define at the pipeline level, and arguments that you pass as you invoke the pipeline on demand or from a trigger. Control flows also include custom state passing and looping containers (that is, foreach iterators).

For more information about Data Factory concepts, see the following articles:

  • Dataset and linked services
  • Pipelines and activities
  • Integration runtime

8.What is Azure Redis Cache and how to implement it?

Ans: Azure Redis Cache is a managed version of the popular open source version of Redis Cache which makes it easy for you to add Redis into your applications that are running in Azure. Redis is an in-memory database where data is stored as a key-value pair so the keys can contain data structures like strings, hashes, and lists. You can cache information in Redis and can easily read it out because it is easier to work with memory than it is to go from the disk and talk to a SQL Server.

  • Suppose, we have a web server where your web application is running. The back-end has SQL Server implementation where the SQL Server is running on a VM or maybe it is an Azure SQL database.
  • A user comes to your application and they go to a page that has tons of products on it.
  • Now, that page has to go to the database to retrieve the information and then that gets sent back to the web server and gets delivered to the user. But if you have thousands of users hitting that web page and you are constantly hitting the database server, it gets very inefficient.
  • The solution to this is to add Azure Redis Cache and we can cache all of those read operations that are taking place. So, that goes to an in-memory database on the Azure Redis Cache.
  • When other users come back and look for the same information on the web app, it gets retrieved right out of the Azure Redis Cache very quickly and hence we take the pressure of the back-end database server.

While deploying Azure Redis Cache, we can deploy it with a single node, we can deploy it in a different pricing tier with a two node implementation and we can also build an entire cluster with multiple nodes.

Learn more about Azure Redis Cache here: Introduction to Azure Redis Cache.

10.What is Azure SQL Data Warehouse?

Ans: The definition given by the dictionary is “a large store of data accumulated from a wide range of sources within a company and used to guide management decisions”. As per the definition, these warehouses allow collecting the data from the various databases located as remote or distributed systems. It can be built by the integration of the data from the multiple sources that can be used for analytical reporting, decision making etc. SQL Data Warehouse is a cloud-based Enterprise application that allows us to work under parallel processing to quickly analyze a complex query from the huge volume of data. It is also a solution for the Big-Data concepts.

SQL-Database-Warehouse

Learn more here: Getting Started with Microsoft SQL Data Warehouse.

11.What is Azure Table Storage?

Ans: Azure Table storage is a very popular service used across many projects which helps to store structured NoSQL data in the cloud, providing a Key/attribute store with a schemaless design. Table storage is very well known for its schemaless architecture design. The main advantage of using this is, table storage is fast and cost-effective for many types of applications.

Another advantage of table storage is that you can store flexible datasets like user data for a web application or any other device information or any other types of metadata which your service requires.

You can store any number of entities in the table. One storage account may contain any number of tables, up to the capacity limit of the storage account.

Another advantage of Azure Table storage is that it stores a large amount of structured data. The service is a NoSQL datastore which accepts authenticated calls from inside and outside the Azure cloud.

  • It helps to store TBs of structured data.
  • For storing datasets that don’t require complex joins, foreign keys, or stored procedures.
  • Quickly querying data using a clustered index.

12.How to migrate a SQL Server database to Azure SQL?

Ans: It is common to migrate a SQL Server database to Azure SQL. We can use the SSMS’s Import and Export features for this purpose.

13.How To Create Azure Functions?

Ans: Azure Functions is a solution for executing small lines of code or functions in the cloud. We can also select the programming languages we want to use. We pay only for the time our code executes; that is, we pay per usage. It supports a variety of programming languages, like C#, F#, Node.js, Python, PHP or Java. It supports continuous deployment and integration. Azure Functions applications let us develop serverless applications.

Learn more here: How to Create Azure Functions.

14.What Is Azure Databricks?

Ans: Azure Databricks is a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. Designed in collaboration with the founders of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click setup; streamlined workflows and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

As an Azure service, customers automatically benefit from native integration with other Azure services such as Power BI, SQL Data Warehouse, Cosmos DB as well as from enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.

Learn more here: What is Azure Databric

15.What are the steps for creating ETL process in Azure Data Factory?

Ans: While we are trying to extract some data from Azure SQL server database, if something has to be processed, then it will be processed and is stored in the Data Lake Store.

Steps for Creating ETL

  • Create a Linked Service for source data store which is SQL Server Database
  • Assume that we have a cars dataset
  • Create a Linked Service for destination data store which is Azure Data Lake Store
  • Create a dataset for Data Saving
  • Create the pipeline and add copy activity
  • Schedule the pipeline by adding a trigger

16.How do I access data by using the other 80 dataset types in Data Factory?

  • The Mapping Data Flow feature currently allows Azure SQL Database, Azure SQL Data Warehouse, delimited text files from Azure Blob storage or Azure Data Lake Storage Gen2, and Parquet files from Blob storage or Data Lake Storage Gen2 natively for source and sink.
  • Use the Copy activity to stage data from any of the other connectors, and then execute a Data Flow activity to transform data after it’s been staged. For example, your pipeline will first copy into Blob storage, and then a Data Flow activity will use a dataset in source to transform that data.