Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    AI updates from the previous week: Anthropic launches Claude 4 fashions, OpenAI provides new instruments to Responses API, and extra — Might 23, 2025

    May 23, 2025

    Crypto Sniper Bot Improvement: Buying and selling Bot Information

    May 23, 2025

    Upcoming Kotlin language options teased at KotlinConf 2025

    May 22, 2025
    Facebook X (Twitter) Instagram
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    TC Technology NewsTC Technology News
    • Home
    • Big Data
    • Drone
    • Software Development
    • Software Engineering
    • Technology
    TC Technology NewsTC Technology News
    Home»Big Data»Catastrophe restoration methods for Amazon MWAA – Half 2
    Big Data

    Catastrophe restoration methods for Amazon MWAA – Half 2

    adminBy adminJune 17, 2024Updated:June 17, 2024No Comments12 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Catastrophe restoration methods for Amazon MWAA – Half 2
    Share
    Facebook Twitter LinkedIn Pinterest Email
    Catastrophe restoration methods for Amazon MWAA – Half 2


    Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a completely managed orchestration service that makes it simple to run information processing workflows at scale. Amazon MWAA takes care of working and scaling Apache Airflow so you may give attention to growing workflows. Nonetheless, though Amazon MWAA supplies excessive availability inside an AWS Area by means of options like Multi-AZ deployment of Airflow elements, recovering from a Regional outage requires a multi-Area deployment.

    In Half 1 of this collection, we highlighted challenges for Amazon MWAA catastrophe restoration and mentioned greatest practices to enhance resiliency. Particularly, we mentioned two key methods: backup and restore and heat standby. On this put up, we dive deep into the implementation for each methods and supply a deployable resolution to appreciate the architectures in your individual AWS account.

    The answer for this put up is hosted on GitHub. The README within the repository affords tutorials in addition to additional workflow particulars for each backup and restore and heat standby methods.

    Backup and restore structure

    The backup and restore technique entails periodically backing up Amazon MWAA metadata to Amazon Easy Storage Service (Amazon S3) buckets within the main Area. The backups are replicated to an S3 bucket within the secondary Area. In case of a failure within the main Area, a brand new Amazon MWAA setting is created within the secondary Area and hydrated with the backed-up metadata to revive the workflows.

    The mission makes use of the AWS Cloud Growth Equipment (AWS CDK) and is ready up like a regular Python mission. Check with the detailed deployment steps within the README file to deploy it in your individual accounts.

    The next diagram exhibits the structure of the backup and restore technique and its key elements:

    • Major Amazon MWAA setting – The setting within the main Area hosts the workflows
    • Metadata backup bucket – The bucket within the main Area shops periodic backups of Airflow metadata tables
    • Replicated backup bucket – The bucket within the secondary Area syncs metadata backups by means of Amazon S3 cross-Area replication
    • Secondary Amazon MWAA setting – This setting is created on-demand throughout restoration within the secondary Area
    • Backup workflow – This workflow periodically backups up Airflow metadata to the S3 buckets within the main Area
    • Restoration workflow – This workflow screens the first Amazon MWAA setting and initiates failover when wanted within the secondary Area

     

    The backup restore architecture

    Determine 1: The backup restore structure

    There are primarily two workflows that work in conjunction to attain the backup and restore performance on this structure. Let’s discover each workflows intimately and the steps as outlined in Determine 1.

    Backup workflow

    The backup workflow is answerable for periodically taking a backup of your Airflow metadata tables and storing them within the backup S3 bucket. The steps are as follows:

    • [1.a] You may deploy the supplied resolution out of your steady integration and supply (CI/CD) pipeline. The pipeline features a DAG deployed to the DAGs S3 bucket, which performs backup of your Airflow metadata. That is the bucket the place you host your entire DAGs on your setting.
    • [1.b] The answer permits cross-Area replication of the DAGs bucket. Any new modifications to the first Area bucket, together with DAG information, plugins, and necessities.txt information, are replicated to the secondary Area DAGs bucket. Nonetheless, for current objects, a one-time replication must be carried out utilizing S3 Batch Replication.
    • [1.c] The DAG deployed to take metadata backup runs periodically. The metadata backup doesn’t embody a few of the auto-generated tables and the listing of tables to be backed up is configurable. By default, the answer backs up variable, connection, slot pool, log, job, DAG run, set off, job occasion, and job fail tables. The backup interval can also be configurable and ought to be primarily based on the Restoration Level Goal (RPO), which is the info loss time throughout a failure that may be sustained by your online business.
    • [1.d] Much like the DAGs bucket, the backup bucket can also be synced utilizing cross-Area replication, by means of which the metadata backup turns into obtainable within the secondary Area.

    Restoration workflow

    The restoration workflow runs periodically within the secondary Area monitoring the first Amazon MWAA setting. It has two capabilities:

    • Retailer the setting configuration of the first Amazon MWAA setting within the secondary backup bucket, which is used to recreate an similar Amazon MWAA setting within the secondary Area throughout failure
    • Carry out the failover when a failure is detected

    The next are the steps for when the first Amazon MWAA setting is wholesome (see Determine 1):

    • [2.a] The Amazon EventBridge scheduler begins the AWS Step Features workflow on a supplied schedule.
    • [2.b] The workflow, utilizing AWS Lambda, checks Amazon CloudWatch within the main Area for the SchedulerHeartbeat metrics of the first Amazon MWAA setting. The setting within the main Area sends heartbeats to CloudWatch each 5 seconds by default. Nonetheless, to not invoke a restoration workflow spuriously, we use a default aggregation interval of 5 minutes to test the heartbeat metrics. Due to this fact, it could possibly take as much as 5 minutes to detect a main setting failure.
    • [2.c] Assuming that the heartbeat was detected in 2.b, the workflow makes the cross-Area GetEnvironment name to the first Amazon MWAA setting.
    • [2.d] The response from the GetEnvironment name is saved within the secondary backup S3 bucket for use in case of a failure within the subsequent iterations of the workflow. This makes certain the newest configuration of your main setting is used to recreate a brand new setting within the secondary Area. The workflow completes efficiently after storing the configuration.

    The next are the steps for the case when the first setting is unhealthy (see Determine 1):

    • [2.a] The EventBridge scheduler begins the Step Features workflow on a supplied schedule.
    • [2.b] The workflow, utilizing Lambda, checks CloudWatch within the main Area for the scheduler heartbeat metrics and detects failure. The scheduler heartbeat test utilizing the CloudWatch API is the beneficial strategy to detect failure. Nonetheless, you may implement a customized technique for failure detection within the Lambda operate equivalent to deploying a DAG to periodically ship customized metrics to CloudWatch or different information shops as heartbeats and utilizing the operate to test that metrics. With the present CloudWatch-based technique, the unavailability of the CloudWatch API might spuriously invoke the restoration move.
    • [2.c] Skipped
    • [2.d] The workflow reads the beforehand saved setting particulars from the backup S3 bucket.
    • [2.e] The setting particulars learn from the earlier step is used to recreate an similar setting within the secondary Area utilizing the CreateEnvironment API name. The API additionally wants different secondary Area particular configurations equivalent to VPC, subnets, and safety teams which might be learn from the user-supplied configuration file or setting variables throughout the resolution deployment. The workflow in a polling loop waits till the setting turns into obtainable and invokes the DAG to revive metadata from the backup S3 bucket. This DAG is deployed to the DAGs S3 bucket as part of the answer deployment.
    • [2.f] The DAG for restoring metadata completes hydrating the newly created setting and notifies the Step Features workflow of completion utilizing the duty token integration. The brand new setting now begins working the energetic workflows and the restoration completes efficiently.

    Issues

    Think about the next when utilizing the backup and restore methodology:

    • Restoration Time Goal – From failure detection to workflows working within the secondary Area, failover can take over half-hour. This contains new setting creation, Airflow startup, and metadata restore.
    • Value – This technique avoids the overhead of working a passive setting within the secondary Area. Prices are restricted to periodic backup storage, cross-Area information switch expenses, and minimal compute for the restoration workflow.
    • Information loss – The RPO will depend on the backup frequency. There’s a design trade-off to contemplate right here. Though shorter intervals between backups can decrease potential information loss, too frequent backups can adversely have an effect on the efficiency of the metadata database and consequently the first Airflow setting. Additionally, the answer can’t recuperate an actively working workflow halfway. All energetic workflows are began recent within the secondary Area primarily based on the supplied schedule.
    • Ongoing administration – The Amazon MWAA setting and dependencies are robotically stored in sync throughout Areas on this structure. As specified within the Step 1.b of the backup workflow, the DAGs S3 bucket will want a one-time deployment of the present sources for the answer to work.

    Heat standby structure

    The nice and cozy standby technique entails deploying similar Amazon MWAA environments in two Areas. Periodic metadata backups from the first Area are used to rehydrate the standby setting in case of failover.

    The mission makes use of the AWS CDK and is ready up like a regular Python mission. Check with the detailed deployment steps within the README file to deploy it in your individual accounts.

    The next diagram exhibits the structure of the nice and cozy standby technique and its key elements:

    • Major Amazon MWAA setting – The setting within the main Area hosts the workflows throughout regular operation
    • Secondary Amazon MWAA setting – The setting within the secondary Area acts as a heat standby able to take over at any time
    • Metadata backup bucket – The bucket within the main Area shops periodic backups of Airflow metadata tables
    • Replicated backup bucket – The bucket within the secondary Area syncs metadata backups by means of S3 Cross-Area Replication.
    • Backup workflow – This workflow periodically backups up Airflow metadata to the S3 buckets in each Areas
    • Restoration workflow – This workflow screens the first setting and initiates failover to the secondary setting when wanted

     

    The warm standby architecture

    Determine 2: The nice and cozy standby structure

    Much like the backup and restore technique, the backup workflow (Steps 1a–1d) periodically backups up vital Amazon MWAA metadata to S3 buckets within the main Area, which is synced within the secondary Area.

    The restoration workflow runs periodically within the secondary Area monitoring the first setting. On failure detection, it initiates the failover process. The steps are as follows (see Determine 2):

    • [2.a] The EventBridge scheduler begins the Step Features workflow on a supplied schedule.
    • [2.b] The workflow checks CloudWatch within the main Area for the scheduler heartbeat metrics and detects failure. If the first setting is wholesome, the workflow completes with out additional actions.
    • [2.c] The workflow invokes the DAG to revive metadata from the backup S3 bucket.
    • [2.d] The DAG for restoring metadata completes hydrating the passive setting and notifies the Step Features workflow of completion utilizing the duty token integration. The passive setting begins working the energetic workflows on the supplied schedules.

    As a result of the secondary setting is already warmed up, the failover is quicker with restoration occasions in minutes.

    Issues

    Think about the next when utilizing the nice and cozy standby methodology:

    • Restoration Time Goal – With a heat standby prepared, the RTO could be as little as 5 minutes. This contains simply the metadata restore and reenabling DAGs within the secondary Area.
    • Value – This technique has an added value of working related environments in two Areas always. With auto scaling for staff, the nice and cozy occasion can preserve a minimal footprint; nonetheless, the net server and scheduler elements of Amazon MWAA will stay energetic within the secondary setting always. The trade-off is considerably decrease RTO.
    • Information loss – Much like the backup and restore mannequin, the RPO will depend on the backup frequency. Sooner backup cycles decrease potential information loss however can adversely have an effect on efficiency of the metadata database and consequently the first Airflow setting.
    • Ongoing administration – This strategy comes with some administration overhead. Not like the backup and restore technique, any modifications to the first setting configurations must be manually reapplied to the secondary setting to maintain the 2 environments in sync. Automated synchronization of the secondary setting configurations is a future work.

    Shared issues

    Though the backup and restore and heat standby methods differ of their implementation, they share some widespread issues:

    • Periodically check failover to validate restoration procedures, RTO, and RPO.
    • Allow Amazon MWAA setting logging to assist debug points throughout failover.
    • Use the AWS CDK or AWS CloudFormation to handle the infrastructure definition. For extra particulars, see the next GitHub repo or Fast begin tutorial for Amazon Managed Workflows for Apache Airflow, respectively.
    • Automate deployments of setting configurations and catastrophe restoration workflows by means of CI/CD pipelines.
    • Monitor key CloudWatch metrics like SchedulerHeartbeat to detect main setting failures.

    Conclusion

    On this collection, we mentioned how backup and restore and heat standby methods provide configurable information safety primarily based in your RTO, RPO, and value necessities. Each use periodic metadata replication and restoration to reduce the world of impact of Regional outages.

    Which technique resonates extra along with your use case? Be happy to check out our resolution and share any suggestions or questions within the feedback part!


    In regards to the Authors

    Chandan RupakhetiChandan Rupakheti is a Senior Options Architect at AWS. His major focus at AWS lies within the intersection of Analytics, Serverless, and AdTech companies. He’s a passionate technical chief, researcher, and mentor with a knack for constructing revolutionary options within the cloud. Exterior of his skilled life, he loves spending time together with his household and associates apart from listening and taking part in music.

    Parnab Basak is a Senior Options Architect and a Serverless Specialist at AWS. He makes a speciality of creating new options which might be cloud native utilizing fashionable software program improvement practices like serverless, DevOps, and analytics. Parnab works carefully within the analytics and integration companies house serving to clients undertake AWS companies for his or her workflow orchestration wants.



    Supply hyperlink

    Post Views: 80
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    admin
    • Website

    Related Posts

    Do not Miss this Anthropic’s Immediate Engineering Course in 2024

    August 23, 2024

    Healthcare Know-how Traits in 2024

    August 23, 2024

    Lure your foes with Valorant’s subsequent defensive agent: Vyse

    August 23, 2024

    Sony Group and Startale unveil Soneium blockchain to speed up Web3 innovation

    August 23, 2024
    Add A Comment

    Leave A Reply Cancel Reply

    Editors Picks

    AI updates from the previous week: Anthropic launches Claude 4 fashions, OpenAI provides new instruments to Responses API, and extra — Might 23, 2025

    May 23, 2025

    Crypto Sniper Bot Improvement: Buying and selling Bot Information

    May 23, 2025

    Upcoming Kotlin language options teased at KotlinConf 2025

    May 22, 2025

    Mojo and Constructing a CUDA Substitute with Chris Lattner

    May 22, 2025
    Load More
    TC Technology News
    Facebook X (Twitter) Instagram Pinterest Vimeo YouTube
    • About Us
    • Contact Us
    • Disclaimer
    • Privacy Policy
    • Terms and Conditions
    © 2025ALL RIGHTS RESERVED Tebcoconsulting.

    Type above and press Enter to search. Press Esc to cancel.