By Azhar Mehmood, R&D
Before diving into Azure Data Factory (ADF), let’s ponder a fundamental question:
What is ETL?
And why is it necessary?
ETL stands for extract, transform, and load—a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system.
ETL helps businesses and organizations make sense of their data. It takes data from various sources, cleans it up, transforms it into a more useful format, and loads it into a place where it can be analysed to make informed decisions, ensuring data compliance with industry regulations as well as facilitating auditing and traceability for regulatory purposes
Azure Data Factory is a fully managed, serverless data integration solution for ingesting, preparing, and transforming all your data at scale. (To learn more about how OZ, a certified Microsoft Solutions Partner, can help you leverage Azure and other leading-edge Microsoft tools, click here.)
It enables every organization in every industry to use it for a rich variety of use cases, from data engineering, migrating on-premises SSIS packages to Azure, and operational data integration to analytics, ingesting data into data warehouses, and more.
To design a data flow in Azure Data Factory, you first specify the data sources that you want to draw data from, and then apply a rich set of transformation on the data, before writing it to a data store.
Underneath the hood, Azure Data Factory runs these data flows for you at scale using a Spark cluster.
Whether it is working with megabytes of data (MB) to terabytes of data (TB), Azure Data Factory runs the data transformation at spark scale, without you having to set up a Spark cluster, or tune it. In many ways, the data transformation just works.
Understand Azure Data Factory Core Concepts
Let’s explore some essential concepts to better grasp the landscape of Azure Data Factory
- In Azure Data Factory a pipeline is like a series of connected and automated steps that allow us to move and transform data from one place to another. It’s like a virtual assembly line for your data. Linked services act as bridges that connect Azure Data Factory to external data sources. They are configurations that define the connection information needed to connect to data stores, databases, or even applications. Picture them as the key connectors that enable our pipeline to reach out to various data locations, providing the necessary link between the factory and the raw data.
- These represent the structured containers that hold your data. These can be files, tables, or even blobs—essentially, the raw data waiting to be processed. Datasets define the structure and format of your data, ensuring a standardized approach to handling information within the pipeline. Activities are the individual tasks or operations that make up our pipeline. These could range from copying files and transforming data to running custom scripts. Triggers determine when a pipeline should be executed, whether on a schedule, in response to an event, or even manually triggered.
- Triggers add an element of automation. Integration Runtime is the engine that powers data movement and transformation. It provides the compute infrastructure that executes the activities within your pipeline. Whether in the cloud or on-premises.
- Integration Runtime. This ensures that your data flows seamlessly from source to destination.
Connection and workflow of ADF
The image below is an example of an ADF copy workflow and the components and services used to orchestrate the task. Let’s go through the flow and learn each part involved in an ADF transformation job.
This example activity uses two datasets as input source and output destination (sink) and these datasets are connected using Linked services. Linked services are used to connect the actual storage location of data on both the source end and destination end. Integration Runtime binds this complete workflow and to automate or perform this task with single click, you will create a Pipeline.
Azure Data Factory enables any developer to use it as part of a continuous integration and delivery (CI/CD) process. CI/CD with Azure Data Factory enables a developer to move Data Factory assets (pipelines, data flows, linked services, and more) from one environment (development, test, production) to another.
Out of the box, Azure Data Factory provides native integration with Azure DevOps and GitHub.
Monitor and manage Azure Data Factory
Azure Data Factory also provides a way to monitor and manage pipelines. To launch the Monitor and Management app, click the Monitor & Manage tile on the Data Factory blade for your data factory.
- The first tab (Pipeline runs) is selected by default. We will see the following options to monitor the executed pipeline in Triggered Mode or if pipeline execute in debug mode that click on Debug to monitor that pipeline.
- Trigger runs can also be seen from the left menu.
- In the notifications menu we can set up alerts to monitor our pipeline.
Read also: Harnessing the Power of Azure IoT for Business Innovation.
Security Considerations
Imagine we are connecting with Azure SQL database. The default approach is to connect with database server using the database credentials. But to increase security we should create Azure Key Vault to store that database credential then we will access those credentials using that Azure key vault linked services.
Managed identities help eliminate the need for storing and managing credentials—such as usernames and passwords—which can be vulnerable to security threats like unauthorized access or credential leakage.
Be sure to use managed identity when connecting to blob storages, Azure Data Lake Storage Gen 2, etcetera. To take it to the next level, use a private end point to move our data from source to destination.
Azure Data Factory provides more than one-hundred connectors to connect with source data. So, for example, let’s say we have requirement for moving and transforming that data from Google Cloud Storage to Azure Blob Storage. We can do this seamlessly with Azure Data Factory. We just need to create the Copy Activity and Transform Activity based on requirements. Azure data factory gives us the option to move and transform our data from cloud to cloud and from Prem to cloud.
Data migration activities with Azure Data Facto
With Microsoft Azure Data Factory, data migration occurs between two cloud data stores and between an on-premises data store and a cloud data store. Copy Activity in Azure Data Factory copies data from a source data store to a sink data store. Azure supports various data stores such as source or sink data stores like Azure Blob storage, Azure Cosmos DB (Document DB API), Azure Data Lake Store, Oracle, Cassandra, etc. Azure Data Factory supports transformation activities such as Hive, MapReduce, Spark, etc. that can be added to pipelines either individually or chained with other activities. Transform data in Azure Data Factory.
If we want to move data to/from a data store that Copy Activity doesn’t support, we can use a .NET custom activity in Azure Data Factory with our own logic for copying/moving data.
Your Next Steps
Azure Data Factory can be the all-in-one solution that makes everything work together seamlessly—if you have the right partner in your corner.
As Florida’s premier Microsoft Solutions Partner, OZ can put more than a quarter century of experience and expertise at your disposal, empowering you to accelerate your success by leveraging Microsoft leading-edge technologies.
Ready to get started?
Reach out to schedule a free consultation today