Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 37

Azure Products

ETL Services
 Azure Data Factory
 Azure Databricks
 Azure Synapse Analytics
 Azure Stream Analytics
 Azure Logic Apps

Storage Services
 Azure Blob Storage

 Azure File Storage

 Azure Queue Storage

 Azure Table Storage

 Azure Disk Storage

 Azure Data Lake Storage

 Azure NetApp Files

Database Services
 Azure Cosmos DB

 Azure SQL Database

 Azure Database for PostgreSQL

 Azure Database for MySQL

 Azure Database for MariaDB

 Azure Database Migration Service

 Azure Cache for Redis

 Azure Synapse Analytics

Analytics Services
 Azure Synapse Analytics

 Azure HDInsight

 Azure Stream Analytics

 Azure Data Lake Analytics

 Azure Data Explorer

 Azure Databricks
Compute Services
 Azure Virtual Machines

 Azure Kubernetes Service

 Azure Functions

 Azure Batch

 Azure Container Instances

 Azure Service Fabric

 Azure Dedicated Host

 Azure Spring Cloud

Containers + Serverless Services


 Azure Kubernetes Service

 Azure Container Instances

 Azure Functions

 Azure Logic Apps

 Azure API Management

Security + Identity Services


 Azure Active Directory

 Azure Security Center

 Azure Key Vault

 Azure Sentinel

 Azure Information Protection

 Azure Bastion

 Azure Private Link

Developer Tools Services


 Azure DevOps

 Azure Artifacts

 Azure Test Plans

 Azure API Management

Management + Governance Services


 Azure Resource Manager

 Azure Monitor

 Azure Service Health


 Azure Advisor

 Azure Policy

 Azure Blueprint

 Azure Cost Management

 Azure Lighthouse

Migration Services
 Azure Migrate

 Azure Site Recovery

Mixed Reality Services


 Azure Spatial Anchors

 Azure Remote Rendering

 Azure Object Anchors

Networking Services
 Azure Virtual Network

 Azure ExpressRoute

 Azure Load Balancer

 Azure Application Gateway

 Azure VPN Gateway

 Azure Firewall

 Azure DDoS Protection

 Azure Front Door

Artificial Intelligence + Machine Learning Services


 Azure Cognitive Services

 Azure Machine Learning

 Azure Bot Service

Internet of Things (IoT) Services


 Azure IoT Hub

 Azure IoT Central

 Azure Digital Twins

 Azure Sphere

 Azure Time Series Insights

Blockchain Services
 Azure Blockchain Service
Important Azure Products for ADF ETL developers

 Azure Blob Storage: Azure Blob Storage is a cloud-based object storage service that is used to
store unstructured data, such as text and binary data. It can be used as a data source or data
destination for Azure Data Factory ETL pipelines.

 Azure SQL Database: Azure SQL Database is a managed relational database service that is
used to store structured data. It can be used as a data source or data destination for Azure
Data Factory ETL pipelines.

 Azure Data Lake Storage: Azure Data Lake Storage is a cloud-based storage service that is
optimized for big data workloads. It can be used to store structured, semi-structured, and
unstructured data, and can be used as a data source or data destination for Azure Data
Factory ETL pipelines.

 Azure Functions: Azure Functions is a serverless compute service that allows developers to
run event-driven code without managing infrastructure. It can be used to trigger Azure Data
Factory pipelines based on events, such as a new file being added to Azure Blob Storage.

 Azure Key Vault: Azure Key Vault is a cloud-based service that allows developers to store and
manage cryptographic keys and secrets. It can be used to securely store credentials and
other sensitive information that is needed for Azure Data Factory ETL pipelines.

 Azure Monitor: Azure Monitor is a cloud-based service that provides visibility into the
performance and health of Azure resources. It can be used to monitor Azure Data Factory
pipelines and trigger alerts based on defined conditions.

Building Blocks of Azure Data Factory

Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and
manage data pipelines that move and transform data from various sources. Here are the building
blocks of Azure Data Factory:

1. Data Sources: Data sources are the starting point for a data pipeline. Azure Data Factory
supports various data sources such as SQL Server, Azure Blob Storage, Azure Data Lake
Storage, Oracle, and more.
2. Data Destinations: Data destinations are the endpoints of a data pipeline. Azure Data Factory
supports various data destinations such as SQL Server, Azure Blob Storage, Azure Data Lake
Storage, and more.
3. Linked services: Linked services define the connection and authentication information
required to connect to different data sources, such as Azure Blob Storage, Azure SQL
Database, or Salesforce.
4. Datasets: Datasets define the schema and metadata of the data source, such as the location
of the data, file format, and data structure.
5. Pipelines: Pipelines define the flow of data from the source to the destination by specifying
the activities that perform the data transformations and movements.
6. Activities: Activities are the building blocks of pipelines and represent a specific action that
takes place in the pipeline. There are different types of activities such as data movement
activities, data transformation activities, control flow activities, and custom activities.
7. Triggers: Triggers define when a pipeline or activity should run, such as a time-based
schedule or an event-based trigger.
8. Integration Runtimes: Integration Runtimes provide the compute environment and resources
required to execute activities within a pipeline. There are three types of Integration
Runtimes: Azure, Self-hosted, and Azure-SSIS.

By using these building blocks, you can create, configure, and manage your data integration
workflows in Azure Data Factory, which can help you achieve reliable and efficient data integration at
scale.

Types of Tasks in ADF


Azure Data Factory provides various types of tasks that can be used to build data integration and
transformation pipelines. Here are some of the commonly used types of tasks in Azure Data Factory:

1. Data Movement: The Data Movement task is used to move data between different data
sources and sinks. It supports various data sources, including Azure Blob Storage, Azure Data
Lake Storage, Azure SQL Database, and more.

2. Data Transformation: The Data Transformation task is used to transform data between
different data formats and structures. It supports various data transformation activities, such
as mapping, filtering, joining, and aggregating data.

3. Control Flow: The Control Flow task is used to control the flow of data in a pipeline. It
supports various control flow activities, such as conditional statements, loops, and error
handling.

4. Integration Runtimes: The Integration Runtimes task is used to manage the execution
environment for data integration tasks. It supports various integration runtime types,
including Azure Integration Runtime, Self-hosted Integration Runtime, and Azure-SSIS
Integration Runtime.

5. Data Flows: The Data Flows task is used to visually create data transformation logic using a
drag-and-drop interface. It supports various data transformation activities, such as
transformations, joins, and aggregations.

6. Monitoring and Management: The Monitoring and Management task is used to monitor and
manage the execution of pipelines and tasks. It supports various monitoring and
management features, such as pipeline runs, trigger management, and alerts.

Types of Activities
Azure Data Factory is a cloud-based data integration service that allows you to create, schedule, and
manage data pipelines to move and transform data from various sources to various destinations.
Here are some of the different types of activities in Azure Data Factory with examples:

1. Copy Activity: The Copy activity is used to copy data from one data store to another. For
example, you can use the Copy activity to copy data from an on-premises SQL Server
database to an Azure SQL database.

2. Execute Pipeline Activity: The Execute Pipeline activity is used to call another pipeline from
within the current pipeline. For example, you can use this activity to execute a pipeline that
contains a data transformation activity after the data has been copied.

3. Web Activity: The Web activity is used to call a REST API endpoint or a web service. For
example, you can use this activity to call an API to retrieve data from an external system.
4. Stored Procedure Activity: The Stored Procedure activity is used to call a stored procedure in
a SQL Server database. For example, you can use this activity to execute a stored procedure
that performs a data transformation.

5. If Condition Activity: The If Condition activity is used to create a conditional workflow in your
pipeline. For example, you can use this activity to check if a file exists in a data store and only
continue with the pipeline if the file is found.

6. For Each Activity: The For Each activity is used to iterate over a collection of items and
perform an action on each item. For example, you can use this activity to loop through a list
of files and copy each file to a destination.

7. Lookup Activity: The Lookup activity is used to retrieve metadata or a single value from a
data store. For example, you can use this activity to get the schema of a table in a SQL Server
database.

8. Set Variable Activity: The Set Variable activity is used to set the value of a variable in a
pipeline. For example, you can use this activity to set a variable that holds the current date
or time.

9. Wait Activity: The Wait activity is used to pause the execution of a pipeline for a specified
period of time. For example, you can use this activity to wait for a specific time to start a data
transfer operation.

10. Filter Activity: The Filter activity is used to filter data based on a specified condition. For
example, you can use this activity to filter data based on a specific column value before
transferring the data to a destination.

11. Join Activity: The Join activity is used to join data from two or more sources. For example,
you can use this activity to join data from two tables in a SQL Server database.

12. Union Activity: The Union activity is used to combine data from two or more sources. For
example, you can use this activity to combine data from two tables in a SQL Server database
into a single destination.

13. Lookup Activity: The Lookup activity is used to retrieve metadata or a single value from a
data store. For example, you can use this activity to get the schema of a table in a SQL Server
database.

14. Set Variable Activity: The Set Variable activity is used to set the value of a variable in a
pipeline. For example, you can use this activity to set a variable that holds the current date
or time.

15. If Condition Activity: The If Condition activity is used to create a conditional workflow in your
pipeline. For example, you can use this activity to check if a file exists in a data store and only
continue with the pipeline if the file is found.

16. Until Activity: The Until activity is used to execute a loop until a specific condition is met. For
example, you can use this activity to keep copying data until a specific file is found in a data
store.

17. Mapping Data Flow Activity: The Mapping Data Flow activity is used to visually design and
build data transformation logic using a drag-and-drop interface. For example, you can use
this activity to transform data from one format to another, or to combine data from multiple
sources.

18. Databricks Notebook Activity: The Databricks Notebook activity is used to run a Databricks
notebook in a Databricks workspace. For example, you can use this activity to run a Python
or Scala script to transform data.

19. HDInsight Hive Activity: The HDInsight Hive activity is used to execute Hive queries on an
HDInsight cluster. For example, you can use this activity to transform data using HiveQL.

20. HDInsight Pig Activity: The HDInsight Pig activity is used to execute Pig scripts on an
HDInsight cluster. For example, you can use this activity to transform data using Pig Latin.

21. HDInsight MapReduce Activity: The HDInsight MapReduce activity is used to execute
MapReduce jobs on an HDInsight cluster. For example, you can use this activity to perform
complex data transformations on large datasets.

22. Custom Activity: The Custom activity is used to run custom code in a data pipeline. For
example, you can use this activity to run a PowerShell script to perform a specific task.

23. Execute SSIS Package Activity: The Execute SSIS Package activity is used to execute an SSIS
package stored in an Azure Storage account or a SQL Server Integration Services (SSIS)
catalog. For example, you can use this activity to perform data transformations using an
existing SSIS package.

24. Delete Activity: The Delete activity is used to delete data from a data store. For example, you
can use this activity to delete files from an Azure Blob Storage container.

25. Teradata Query Activity: The Teradata Query activity is used to execute queries on a
Teradata database. For example, you can use this activity to extract data from a Teradata
database.

26. Amazon S3 Storage Activity: The Amazon S3 Storage activity is used to copy data between
an Amazon S3 storage account and an Azure Data Factory-supported data store. For example,
you can use this activity to transfer data between an Amazon S3 storage account and an
Azure Blob Storage account.

27. Azure Function Activity: The Azure Function activity is used to execute an Azure Function in a
pipeline. For example, you can use this activity to perform custom data transformations
using an Azure Function.

28. Wait Event Activity: The Wait Event activity is used to pause the execution of a pipeline until
a specific event occurs. For example, you can use this activity to wait for a signal from an
external system before proceeding with the pipeline.

29. Amazon Redshift Query Activity: The Amazon Redshift Query activity is used to execute
queries on an Amazon Redshift database. For example, you can use this activity to extract
data from an Amazon Redshift database.

30. Web Activity: The Web activity is used to call a REST API or a web endpoint from a pipeline.
For example, you can use this activity to call an API to retrieve data or to perform an action.
31. Azure Analysis Services Activity: The Azure Analysis Services activity is used to execute a
command or a query against an Azure Analysis Services database. For example, you can use
this activity to refresh a cube in an Azure Analysis Services database.

32. SharePoint Online List Activity: The SharePoint Online List activity is used to copy data
between a SharePoint Online list and an Azure Data Factory-supported data store. For
example, you can use this activity to transfer data between a SharePoint Online list and an
Azure SQL Database.

33. Stored Procedure Activity: The Stored Procedure activity is used to execute a stored
procedure in a database. For example, you can use this activity to perform a custom data
transformation using a stored procedure.

34. Lookup with a Stored Procedure Activity: The Lookup with a Stored Procedure activity is
used to retrieve data from a database using a stored procedure. For example, you can use
this activity to retrieve data from a SQL Server database using a stored procedure.

35. Copy Activity: The Copy activity is used to copy data between different data stores. For
example, you can use this activity to copy data from an on-premises SQL Server database to
an Azure Blob Storage container.

36. IF Condition Activity: The IF Condition activity is used to evaluate a Boolean expression and
perform different actions based on the result. For example, you can use this activity to
perform different data transformations based on a condition.

37. For Each Activity: The For Each activity is used to loop through a set of items and perform an
action for each item. For example, you can use this activity to process a set of files stored in
an Azure Blob Storage container.

38. Until Activity: The Until activity is used to repeatedly perform an action until a certain
condition is met. For example, you can use this activity to keep polling a system until a
certain status is returned.

39. Filter Activity: The Filter activity is used to filter data based on a condition. For example, you
can use this activity to filter out data that does not meet certain criteria.

40. Set Variable Activity: The Set Variable activity is used to set the value of a variable that can
be used in later activities. For example, you can use this activity to set a variable to the
current date and time.

41. Azure Databricks Notebook Activity: The Azure Databricks Notebook activity is used to
execute a Databricks notebook in a pipeline. For example, you can use this activity to
perform advanced data processing and analytics using Databricks.

42. Lookup Activity: The Lookup activity is used to retrieve data from a data store. For example,
you can use this activity to retrieve metadata from a file stored in Azure Blob Storage.

43. Wait Activity: The Wait activity is used to pause the execution of a pipeline for a specified
amount of time. For example, you can use this activity to introduce a delay between two
activities in a pipeline.

44. If Condition Branch Activity: The If Condition Branch activity is used to define the action that
should be taken based on the result of an If Condition activity. For example, you can use this
activity to perform different data transformations based on the result of the If Condition
activity.

45. Get Metadata Activity: The Get Metadata activity is used to retrieve metadata about a file or
a folder stored in a data store. For example, you can use this activity to retrieve the size,
type, and last modified date of a file stored in Azure Blob Storage.

46. Union Activity: The Union activity is used to combine the results of two or more data
sources. For example, you can use this activity to combine the results of two different SQL
queries into a single data set.

Types of Data movement activities

Azure Data Factory provides a variety of data movement activities that can be used to copy data
from source to destination data stores. Here are some of the most common types of data
movement activities:

1. Copy Activity: Copy activity is used to copy data from a source data store to a destination
data store. It supports various sources and destinations, such as Azure Blob Storage,
Azure Data Lake Storage, Azure SQL Database, Amazon S3, and more.

2. Data Flow Activity: Data Flow activity provides a code-free visual interface to design and
execute data transformations at scale. It supports complex data transformations, such as
aggregation, filtering, pivoting, and joining.

3. Lookup Activity: Lookup activity is used to retrieve data from a source data store based
on a specified query or filter condition. It is useful when you need to retrieve a small
amount of data that is used to enrich or augment other data in the pipeline.

4. Stored Procedure Activity: Stored Procedure activity is used to execute a stored


procedure in a SQL database or a SQL Server instance. It supports input and output
parameters and can be used to perform complex data transformations.

5. Web Activity: Web activity is used to call a REST API endpoint or a web service to retrieve
or post data. It can be used to integrate with external services, such as Salesforce,
Marketo, or Google Analytics.
6. Execute SSIS Package Activity: Execute SSIS Package activity is used to execute an SSIS
package in Azure-SSIS Integration Runtime. It supports various types of package
execution, such as package stored in Azure Blob Storage or package deployed to an SSIS
catalog.

By using these different types of data movement activities, you can copy, transform, and
integrate data from various sources and destinations in Azure Data Factory.

Types of Data transformation tasks

Azure Data Factory provides various data transformation tasks that can be used to transform data
between different data formats and structures. Here are some of the commonly used types of data
transformation tasks in Azure Data Factory:

Mapping
Mapping: The Mapping task is used to map the columns of the source data to the columns of the
target data. It supports various mapping transformations, such as copy, derived column, and
conditional split.

Join
Join: The Join task is used to combine data from two or more data sources based on a common key. It
supports various join types, such as inner join, left outer join, and full outer join.

Aggregate
Aggregate: The Aggregate task is used to group data based on a specific column or set of columns
and calculate summary statistics for each group. It supports various aggregation functions, such as
count, sum, average, and min/max.

Sort
Sort: The Sort task is used to sort data based on one or more columns. It supports sorting in
ascending or descending order.

Pivot
Pivot: The Pivot task is used to convert rows of data into columns. It supports various pivot types,
such as static pivot, dynamic pivot, and unpivot.

Flatten
Flatten: The Flatten task is used to denormalize hierarchical data into a flat structure. It supports
various flattening transformations, such as JSON flatten and XML flatten.

Data Splitting
Data Splitting: The Data Splitting task is used to split a dataset into multiple outputs based on a
specific condition. It supports various splitting conditions, such as random splitting, percentage
splitting, and size-based splitting.
Merge
Merge: The Merge task is used to merge data from two or more sources based on a specific column
or set of columns. It supports various merge types, such as inner merge and outer merge.

Merge Transformation
Merge Transformation: The Merge transformation is used to combine two or more datasets into a
single output dataset. It supports joining on one or more columns and specifying the join type.

Filter
Filter: The Filter task is used to filter data based on specific criteria. It supports various filtering
functions, such as equal to, greater than, less than, and contains.

Lookup
Lookup: The Lookup task is used to look up values from a reference dataset and add them to the
output of a data flow. It supports various lookup types, such as full cache, partial cache, and no
cache.

Derived Column
Derived Column: The Derived Column task is used to create new columns in a data flow by deriving
values from existing columns. It supports various derived column expressions, such as string
manipulation and date/time functions.

Pivot and Unpivot Transformation


Pivot and Unpivot Transformation: The Pivot and Unpivot transformations are used to rotate or
transpose a dataset. The Pivot transformation is used to convert rows into columns while the Unpivot
transformation is used to convert columns into rows.

Pivot Transformation
Pivot Transformation: The Pivot transformation is used to pivot a dataset by aggregating values in
one or more columns and creating new columns for each unique value. It supports various
aggregation functions, such as count, sum, average, and min/max.

Unpivot Transformation
Unpivot Transformation: The Unpivot transformation is used to unpivot a dataset by creating new
rows for each unique value in one or more columns. It supports various unpivoting options, such as
specifying the column prefix.
Pivot with Aggregation
Pivot with Aggregation: The Pivot with Aggregation task is used to transform a dataset by pivoting
one or more columns and aggregating values in the new columns. It supports various aggregation
functions, such as count, sum, average, and min/max.

Unpivot with Aggregation


Unpivot with Aggregation: The Unpivot with Aggregation task is used to transform a dataset by
unpivoting one or more columns and aggregating values in the new rows. It supports various
aggregation functions, such as count, sum, average, and min/max.

Aggregate Transformation
Aggregate Transformation: The Aggregate transformation is used to group data based on one or
more columns and calculate summary statistics for each group. It supports various aggregation
functions, such as count, sum, average, and min/max.

Aggregate Transformation
Aggregate Transformation: The Aggregate transformation is used to calculate summary statistics for a
dataset, such as count, sum, average, and min/max. It supports grouping by one or more columns.

split
Split: The Split task is used to split a column into multiple columns or split a row into multiple rows
based on a delimiter. It supports various splitting functions, such as split by delimiter and split by
pattern.

Split Transformation
Split Transformation: The Split transformation is used to split a single column into multiple columns
or split a single row into multiple rows based on a delimiter or pattern.

Split Transformation
Split Transformation: The Split transformation is used to split a dataset into two or more outputs
based on a specific condition. It supports various splitting conditions, such as percentage splitting
and size-based splitting.

Conditional Split
Conditional Split: The Conditional Split task is used to split a dataset into different outputs based on a
set of conditions. It supports various conditional functions, such as equals, not equals, contains, and
starts with.
Window Transformation
Window Transformation: The Window transformation is used to calculate aggregate values over a
window of rows in a dataset. It supports various window functions, such as row number, rank, dense
rank, and lag/lead.

Pivot Mapping
Pivot Mapping: The Pivot Mapping task is used to transform a dataset by pivoting one or more
columns to create a new column for each unique value in the original column.

Unpivot Mapping
Unpivot Mapping: The Unpivot Mapping task is used to transform a dataset by unpivoting one or
more columns to create a new row for each unique value in the original column.

Data Conversion
Data Conversion: The Data Conversion task is used to convert data types between different formats.
It supports various data conversion functions, such as string to integer, integer to string, and
date/time conversion.

Data masking
Data Masking: The Data Masking task is used to mask sensitive data in a dataset. It supports various
data masking functions, such as partial masking and full masking.

Join Mapping
Join Mapping: The Join Mapping task is used to combine data from two or more data sources based
on a common key. It supports various join types, such as inner join, left outer join, and full outer join.

Select Transformation
Select Transformation: The Select transformation is used to select a subset of columns from a
dataset. It supports selecting individual columns or groups of columns.

Script Transformation
Script Transformation: The Script transformation is used to run custom scripts on a dataset. It
supports various scripting languages, such as Python, R, and JavaScript.
Type Mapping
Type Mapping: The Type Mapping task is used to map source column types to target column types. It
supports various data types, such as string, boolean, integer, and decimal.

Schema Mapping
Schema Mapping: The Schema Mapping task is used to map source schema to target schema. It
supports various schema types, such as flat file schema, XML schema, and JSON schema.

Conditional Mapping
Conditional Mapping: The Conditional Mapping task is used to apply transformation to data based on
a set of conditions. It supports various conditional functions, such as equals, not equals, contains,
and starts with.

Data Aggregation
Data Aggregation: The Data Aggregation task is used to aggregate data based on one or more
columns and calculate summary statistics for each group. It supports various aggregation functions,
such as count, sum, average, and min/max.

Data Reshaping
Data Reshaping: The Data Reshaping task is used to reshape a dataset into a new format or structure.
It supports various reshaping functions, such as melting and casting.

Derived Column Transformation


Derived Column Transformation: The Derived Column transformation is used to create new columns
by applying expressions to existing columns. It supports various data types, such as string, boolean,
integer, and decimal.
Types of Control Flow tasks

Azure Data Factory provides different types of control flow tasks that can be used to orchestrate data
integration workflows. Here are some of the most common types of control flow tasks in Azure Data
Factory:

1. If Condition: If Condition task is used to evaluate a condition and execute a specific set of
activities or tasks based on the result. It can be used to perform conditional branching in the
workflow.
2. For Each Loop: For Each Loop task is used to iterate over a list or collection of items and
execute a set of activities or tasks for each item. It supports dynamic iteration over arrays
and object types.
3. Until Condition: Until Condition task is used to repeat a set of activities or tasks until a
specific condition is met. It can be used for iterative processing or looping.
4. Wait: Wait task is used to pause the execution of a pipeline for a specified period of time. It
can be used for scheduling or sequencing activities.
5. Execute Pipeline: Execute Pipeline task is used to execute a child pipeline from the current
pipeline. It can be used for modularization and reusability of pipelines.
6. Execute SSIS Package: Execute SSIS Package task is used to execute an SSIS package in Azure-
SSIS Integration Runtime. It supports various types of package execution, such as package
stored in Azure Blob Storage or package deployed to an SSIS catalog.

By using these different types of control flow tasks, you can build complex data integration workflows
and orchestrate the execution of activities in Azure Data Factory.

Types of Integration Runtime

1. Azure Data Factory provides different types of integration runtimes to facilitate data
integration between various sources and destinations. Here are the most common types of
integration runtimes in Azure Data Factory:
2. Azure Integration Runtime: Azure Integration Runtime is a fully managed data integration
service that is built on Azure and supports integration with various cloud-based and on-
premises data sources. It provides a scalable and secure platform for data integration and
supports various data movement activities and connectors.

3. Self-Hosted Integration Runtime: Self-Hosted Integration Runtime is a component that is


installed on a local server or virtual machine (VM) and is used to facilitate data integration
between on-premises data sources and cloud-based destinations. It provides a secure and
flexible platform for data integration and supports various data movement activities and
connectors.

4. Azure-SSIS Integration Runtime: Azure-SSIS Integration Runtime is a fully managed service


that is used to execute and manage SQL Server Integration Services (SSIS) packages in the
cloud. It provides a scalable and cost-effective platform for migrating and running SSIS
packages in Azure and supports various SSIS package execution modes and configurations.

5. Azure Synapse Analytics Integration Runtime: Azure Synapse Analytics Integration Runtime is
a service that is used to integrate data between Azure Synapse Analytics and other cloud-
based and on-premises data sources. It provides a high-performance and scalable platform
for data integration and supports various data movement activities and connectors.

By using these different types of integration runtimes, you can integrate data between various
sources and destinations in Azure Data Factory and leverage the capabilities of each runtime to meet
your data integration requirements.
Triggers
Azure Data Factory provides various types of triggers that can be used to schedule and automate
data pipelines and activities. Here are some examples of Azure Data Factory triggers:

Schedule Trigger
Schedule Trigger: The Schedule trigger is used to run a pipeline on a recurring schedule, such as
every day at a specific time, or every hour. For example, a pipeline could be triggered to load data
from a source system into a data warehouse every night at 11:00 PM.

Schedule Trigger
Schedule Trigger: This type of trigger is used to start a pipeline on a specific schedule or interval,
such as daily at a certain time, every hour, or every 15 minutes.

Event-based Trigger
Event-based Trigger: The Event-based trigger is used to start a pipeline when an event occurs, such as
a file being added to a storage account or a message being sent to a service bus. For example, a
pipeline could be triggered to process a file as soon as it is uploaded to a storage account.

Event-based Trigger
Event-based Trigger: This type of trigger is used to start a pipeline when a specific event occurs, such
as a file being added or deleted from a storage account, or a new message being sent to a queue.

Tumbling Window Trigger


Tumbling Window Trigger: The Tumbling Window trigger is used to run a pipeline on a recurring
schedule, but with a time interval that slides forward after each run. For example, a pipeline could be
triggered to process data every hour, but only for the data that was added or updated in the last
hour.

Tumbling Window Trigger


Tumbling Window Trigger: This type of trigger is used to start a pipeline on a recurring schedule with
a sliding time window. It is useful when you need to process data that has been added or updated in
the last hour, day, week, or month, for example.

Azure Blob Storage Trigger


Azure Blob Storage Trigger: The Azure Blob Storage trigger is used to start a pipeline when a file is
added or modified in an Azure Blob Storage container. For example, a pipeline could be triggered to
process a file as soon as it is uploaded to a Blob Storage container.
Blob Storage Trigger
Blob Storage Trigger: This type of trigger is used to start a pipeline when a new file is added or
updated in an Azure Blob Storage container.

Azure Queue Storage Trigger


Azure Queue Storage Trigger: The Azure Queue Storage trigger is used to start a pipeline when a
message is added to an Azure Queue Storage queue. For example, a pipeline could be triggered to
process a message as soon as it is added to a Queue Storage queue.

Queue Storage Trigger


Queue Storage Trigger: This type of trigger is used to start a pipeline when a new message is added
to an Azure Queue Storage queue.

Azure Service Bus Trigger


Azure Service Bus Trigger: The Azure Service Bus trigger is used to start a pipeline when a message is
sent to an Azure Service Bus topic or queue. For example, a pipeline could be triggered to process a
message as soon as it is sent to a Service Bus topic.

There are several types of triggers available in Azure Data Factory that can be used to automate data
pipelines and activities. Here are some of the most common types of Azure Data Factory triggers:

Service Bus Trigger


Service Bus Trigger: This type of trigger is used to start a pipeline when a new message is sent to an
Azure Service Bus topic or queue.

Data Lake Storage Gen2 Trigger


Data Lake Storage Gen2 Trigger: This type of trigger is used to start a pipeline when a file is added or
deleted from an Azure Data Lake Storage Gen2 account.

Event Grid Trigger:


Event Grid Trigger: This type of trigger is used to start a pipeline when a specific event occurs in
Azure, such as a resource being created or deleted.

By using these different types of triggers, you can automate your data integration processes in a
variety of ways to meet your specific business needs.
Data Flow Transformations in Azure Data
Factory

https://www.sqlshack.com/data-flow-
transformations-in-azure-data-factory/

Data Flow Transformations in Azure Data


Factory
August 30, 2021 by Gauri Mahajan

This article will help you get introduced to all the transformations offered in the Data Flow
component of Azure Data Factory.

Introduction
A major practice area of data transport and transformation is the development of ETL pipelines.
There are many tools and technologies available that enable building ETL pipelines on-premises
as well as on the cloud. An ETL tool is as efficient as the number of data transformations that is
supported as transformations are the key factors that enable the building of rich ETL pipelines
that developers can use to build data transformations using out-of-box controls instead of
placing large scale custom development efforts to transform the data. Azure Data Factory is a
service on Azure cloud that facilitates developing ETL pipelines. The typical way to transform data
in Azure Data Factory is by using the transformations in the Data Flow component. There are
several transformations available in this component. In this article, we will go over all the
transformations offered in the Data Flow component and we will understand some key settings
and the use-case in which these transforms can be used as well.

Data Flow Transformations


To access the list of transforms in Data Factory, one needs to have an instance of it created using
which we can create data pipelines. We can create Data Flows which can be used as a part of the
data pipeline. In the Data Flow graph, once you have added one or multiple data sources, then
the next logical step is to add one or more transformations to it.

JOIN
This is the first transform you would find in the list when you click on the + sign in the graph to
add transformations. Typically, when you have data from one or more data sources, there is a
need to bind this data into a common stream and for such use-cases, this transform can be used.
Shown below are the different types of joins that are supported along with the option to specify
join conditions with different operators.
SPLIT
In Azure Data Factory, the split transform can be used to divide the data into two streams based
on a criterion. The data can be split based on the first matching criteria or all the matching
criteria as desired. This facilitates discrete types of data processing on data divided categorically
into different streams using this transform.
EXISTS
The Exists transform in Azure Data Factory is an equivalent of SQL EXISTS clause. It can be used
to compare data from one stream with data in another stream using one or multiple conditions.
As shown below, one can use exists as well as does not exist i.e. the inverse of exist to find
matching or unique datasets using different types of expressions.
UNION
The Union transformation in Azure Data Factory is equivalent to the UNION clause in SQL. It can
be used to merge data from two data streams that have identical or compatible schema into a
single data stream. The schema from two streams can be mapped by name or ordinal position of
the columns as shown below in the settings.
LOOKUP
The Lookup transform in Azure Data Factory is one of the most critical data transformations that
is used in data flows that involve transactional systems as well as data warehouses. While loading
data into dimension or facts, one needs to validate if the data already exists to take a
corresponding action of updating or inserting data. In transactional systems as well, this
transform can be used to perform an UPSERT-type action. The lookup transform takes data from
an incoming stream and matches it with data from the lookup stream and appends columns from
the lookup stream into the primary stream.
DERIVED COLUMN
With this transformation, we are starting with the schema modifier category of transforms. Often
there is a need to create new calculated fields or update data in the existing fields in a data
stream. Derived column transformation can be used in such cases. One can add as many fields as
needed in this transformation and provide the calculation expression for the new fields as shown
below.
SELECT
While data in the input stream is being processed, there may be cases that while joining,
merging, splitting, and creating calculated fields, it may result in some unnecessary fields or
duplicate fields. To remove such fields from the data stream, one can rename the fields, change
the mappings as well as remove the undesired fields from the data stream. The SELECT transform
facilitates this functionality to curate the fields in the data stream.

AGGREGATE
Most of the ETL data pipelines involve some form of data aggregations typically while loading
data into a data warehouse or an analytical data repository. One of the obvious mechanisms of
aggregating or rolling up data in SQL is by using the GROUP BY clause. This functionality can be
exercised by using the Aggregate transform in Azure Data Factory. While the group is the most
common way of summarizing data but it’s not the only way. Different types of aggregation
calculations can be created as well for custom calculations based on a certain condition. To
address this scenario, the Aggregate tab (shown below) can be used where such custom
calculations can be configured.
SURROGATE KEY
In a data warehousing scenario, typically in slowly changing dimensions (SCD) where one cannot
use the business key as the primary key as the business key can repeat as different versions of the
same record are created, surrogate keys are created that act as a unique identified for the record.
Azure Data Factory provides a transform to generate these surrogate keys as well using the
Surrogate Key transform.
PIVOT
Converting unique values of rows from a field as columns is known as pivoting of data. Pivoting
data is a very common functionality and is available from most basic data tools like Microsoft
Excel to all different types of databases. Typically, when dealing with data that is in a nested
format, for example, data hosted in JSON files, there is a need to modulate the data into a
specific schema for reporting or aggregation. In such cases, one can use the PIVOT transform.
The settings to configure the pivot transform are as shown below.

UNPIVOT
Unpivot transform is the inverse of pivot transform where it converts columns into rows. In pivot,
the data is grouped as per the criterion, and in this case, the data is ungrouped and unpivoted. If
we compare the settings shown below versus the settings of pivot transform, there are near
identical.
WINDOW
The Window transform supported creating aggregations using windowing functions like RANK
for example. It allows building complex aggregations using custom expressions that can be build
using well knows windowing functions like rank, denserank, ntile, etc. If you have used window
functions in any database, you would find it very easy to understand the graphical interface
shown below that enables configuring similar functionality using this transform.
RANK
Once the data is channeled into different streams, validated with different source and destination
repositories, calculated, and aggregated using custom expressions, towards the end of the data
pipeline, one would typically sort the data. With this, there may be a need to rank the data as per
a specific sorting criterion. The Rank transform in Azure Data Factory facilitates this functionality
with the options shown below.

FLATTEN
At times when data is hierarchical or nested, and the requirement is to flatten the data into a
tabular structure without further optimizations like pivoting or unpivoting, an easier way to
flatten the data in Azure Data Factory is by using the flatten transformation. This transform
converts the values in an array or any nested structure to flattened rows and column structures.
PARSE
Data is mostly read in a structured format from data repositories. But data also exists in semi-
structured or document formats like XML, JSON, and delimited text files. Once the data is sourced
from such format, it may require parsing the fields and data types of the fields in such data
format before the rest of the transform can with used with such data. For this use-case, one can
use the Parse transform using the settings shown below.

FILTER
One of the most common parts of data processing data is filtering the data to limit the scope of
data and process it conditionally. The filter is one such transformation that facilitates filtering the
data in Azure Data Factory.
SORT
Another transform that goes together with filter transform is the sort transform. At times, certain
datasets that involve time-series, it may not be possible to process it correctly without sorting.
Also, when the data is ready to be loaded into the destination data repositories, it’s a good
practice to sort the data and load it in a sorted manner. Sort transform can be used in such cases.

These are all the transforms available in the data transform component of Azure Data Factory to
build a data flow that transforms data to the desired shape and size.
Conclusion
In this article, we covered different transforms offered by the Data Transform component of
Azure Data Factory. We understood the high-level use-cases when we would consider using
these transforms and glanced through the configuration settings of these transforms.
ADF helps in Data ingestion, Transformation and Orchestration of Pipelines

Sources  Ingest  Transformation  Target

These transformations can be done by

1. Dataflows in ADF (for less complexity)


2. HD Insight (equivalent to EMR [Elastic Map Reduce] in AWS)
(Google has same DATAPROC equivalent to HD insights in Azure)
Manages Hadoop Cluster on Azure like Hive, Spark, and Pig – Hortonworks Cluster)
3. Databricks (Specially optimized for spark, for complex transformations and you can run spark
in any other cluster like hortonworks cluster and that code can be used using Databricls.
Faster)
4. Azure Synapse Analytics: (can use spark and sql )

We use ADF because there are lot of connectors in ADF and provides end to end things.

The Pipeline can be triggered / Scheduled

Once all this transformations done we again need to put the data back to your data lake

Create a data platform for your ml team

Source  ADF  Data Lake Gen2  Transformations using (Synapse/


Databricks/HD insoghts/ ADF Dataflows  Data Lake Gen2 / Blob storage
(Blob Storage slightly cheaper than data lake storage)  ML team will
get take data

For Adhoc reporting your analytics team need only some part of the data so we would take subset of
the transformed data and keep it in say Azure SQL Database (RDMS) / Azure Synapse
Analytics (Data warehouse) and it can be queried instantly.

Later using visualization tools like Tableau, Power BI and GRAFFANA we can visalize the data.

You might also like