Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that makes it easy to arrange and function end-to-end knowledge pipelines within the cloud.
Organizations use Amazon MWAA to reinforce their enterprise workflows. For instance, C2i Genomics makes use of Amazon MWAA of their knowledge platform to orchestrate the validation of algorithms processing most cancers genomics knowledge in billions of information. Twitch, a stay streaming platform, manages and orchestrates the coaching and deployment of its suggestion fashions for over 140 million energetic customers. They use Amazon MWAA to scale, whereas considerably enhancing safety and lowering infrastructure administration overhead.
In the present day, we’re asserting the supply of Apache Airflow model 2.8.1 environments on Amazon MWAA. On this publish, we stroll you thru a few of the new options and capabilities of Airflow now obtainable in Amazon MWAA, and how one can arrange or improve your Amazon MWAA atmosphere to model 2.8.1.
Object storage
As knowledge pipelines scale, engineers wrestle to handle storage throughout a number of techniques with distinctive APIs, authentication strategies, and conventions for accessing knowledge, requiring customized logic and storage-specific operators. Airflow now affords a unified object storage abstraction layer that handles these particulars, letting engineers concentrate on their knowledge pipelines. Airflow object storage makes use of fsspec to allow constant knowledge entry code throughout completely different object storage techniques, thereby streamlining infrastructure complexity.
The next are a few of the function’s key advantages:
- Transportable workflows – You’ll be able to change storage companies with minimal modifications in your Directed Acyclic Graphs (DAGs)
- Environment friendly knowledge transfers – You’ll be able to stream knowledge as a substitute of loading into reminiscence
- Lowered upkeep – You don’t want separate operators, making your pipelines easy to take care of
- Acquainted programming expertise – You should use Python modules, like shutil, for file operations
To make use of object storage with Amazon Easy Storage Service (Amazon S3), it’s essential set up the package deal further s3fs with the Amazon supplier (apache-airflow-providers-amazon[s3fs]==x.x.x
).
Within the pattern code beneath, you may see find out how to transfer knowledge straight from Google Cloud Storage to Amazon S3. As a result of Airflow’s object storage makes use of shutil.copyfileobj
, the objects’ knowledge is learn in chunks from gcs_data_source
and streamed to amazon_s3_data_target
.
For extra data on Airflow object storage, discuss with Object Storage.
XCom UI
XCom (cross-communications) permits for the passing of knowledge between duties, facilitating communication and coordination between them. Beforehand, builders needed to change to a diffferent view to see XComs associated to a process. With Airflow 2.8, XCom key-values are rendered straight on a tab inside the Airflow Grid view, as proven within the following screenshot.
The brand new XCom tab gives the next advantages:
- Improved XCom visibility – A devoted tab within the UI gives a handy and user-friendly option to see all XComs related to a DAG or process.
- Improved debugging – Having the ability to see XCom values straight within the UI is useful for debugging DAGs. You’ll be able to shortly see the output of upstream duties while not having to manually pull and examine them utilizing Python code.
Process context logger
Managing process lifecycles is essential for the sleek operation of knowledge pipelines in Airflow. Nevertheless, sure challenges have continued, significantly in situations the place duties are unexpectedly stopped. This will happen on account of numerous causes, together with scheduler timeouts, zombie duties (duties that stay in a operating state with out sending heartbeats), or cases the place the employee runs out of reminiscence.
Historically, such failures, significantly these triggered by core Airflow parts just like the scheduler or executor, weren’t recorded inside the process logs. This limitation required customers to troubleshoot exterior the Airflow UI, complicating the method of pinpointing and resolving points.
Airflow 2.8 launched a major enchancment that addresses this downside. Airflow parts, together with the scheduler and executor, can now use the brand new TaskContextLogger to ahead error messages on to the duty logs. This function lets you see all of the related error messages associated to a process’s run in a single place. This simplifies the method of determining why a process failed, providing an entire perspective of what went mistaken inside a single log view.
The next screenshot reveals how the duty is detected as zombie
, and the scheduler log is being included as a part of the duty log.
You should set the atmosphere configuration parameter enable_task_context_logger
to True
, to allow the function. As soon as it’s enabled, Airflow can ship logs from the scheduler, the executor, or callback run context to the duty logs, and make them obtainable within the Airflow UI.
Listener hooks for datasets
Datasets had been launched in Airflow 2.4 as a logical grouping of knowledge sources to create data-aware scheduling and dependencies between DAGs. For instance, you may schedule a client DAG to run when a producer DAG updates a dataset. Listeners allow Airflow customers to create subscriptions to sure occasions taking place within the atmosphere. In Airflow 2.8, listeners are added for 2 datasets occasions: on_dataset_created and on_dataset_changed, successfully permitting Airflow customers to jot down customized code to react to dataset administration operations. For instance, you may set off an exterior system, or ship a notification.
Utilizing listener hooks for datasets is simple. Full the next steps to create a listener for on_dataset_changed
:
- Create the listener (
dataset_listener.py
): - Create a plugin to register the listener in your Airflow atmosphere (
dataset_listener_plugin.py
):
For extra data on find out how to set up plugins in Amazon MWAA, discuss with Putting in customized plugins.
Arrange a brand new Airflow 2.8.1 atmosphere in Amazon MWAA
You’ll be able to provoke the setup in your account and most well-liked Area utilizing the AWS Administration Console, API, or AWS Command Line Interface (AWS CLI). When you’re adopting infrastructure as code (IaC), you may automate the setup utilizing AWS CloudFormation, the AWS Cloud Growth Package (AWS CDK), or Terraform scripts.
Upon profitable creation of an Airflow model 2.8.1 atmosphere in Amazon MWAA, sure packages are robotically put in on the scheduler and employee nodes. For an entire listing of put in packages and their variations, discuss with Apache Airflow supplier packages put in on Amazon MWAA environments. You’ll be able to set up further packages utilizing a necessities file.
Improve from older variations of Airflow to model 2.8.1
You’ll be able to benefit from these newest capabilities by upgrading your older Airflow model 2.x-based environments to model 2.8.1 utilizing in-place model upgrades. To be taught extra about in-place model upgrades, discuss with Upgrading the Apache Airflow model or Introducing in-place model upgrades with Amazon MWAA.
Conclusion
On this publish, we mentioned some essential options launched in Airflow model 2.8, akin to object storage, the brand new XCom tab added to the grid view, process context logging, listener hooks for datasets, and how one can begin utilizing them. We additionally offered some pattern code to indicate implementations in Amazon MWAA. For the entire listing of modifications, discuss with Airflow’s launch notes.
For extra particulars and code examples on Amazon MWAA, go to the Amazon MWAA Person Information and the Amazon MWAA examples GitHub repo.
Apache, Apache Airflow, and Airflow are both registered logos or logos of the Apache Software program Basis in the US and/or different nations.
Concerning the Authors
Mansi Bhutada is an ISV Options Architect primarily based within the Netherlands. She helps clients design and implement well-architected options in AWS that tackle their enterprise issues. She is obsessed with knowledge analytics and networking. Past work, she enjoys experimenting with meals, taking part in pickleball, and diving into enjoyable board video games.
Hernan Garcia is a Senior Options Architect at AWS primarily based within the Netherlands. He works within the monetary companies business, supporting enterprises of their cloud adoption. He’s obsessed with serverless applied sciences, safety, and compliance. He enjoys spending time with household and pals, and attempting out new dishes from completely different cuisines.