This can be a publish co-written with Rivlin Pereira & Vaibhav Pandey from Tanzu CloudHealth (VMware by Broadcom).
VMware Tanzu CloudHealth is the cloud value administration platform of alternative for greater than 20,000 organizations worldwide, who depend on it to optimize and govern their largest and most advanced multi-cloud environments. On this publish, we focus on how the VMware Tanzu CloudHealth DevOps staff migrated their self-managed Apache Kafka workloads (working model 2.0) to Amazon Managed Streaming for Apache Kafka (Amazon MSK) working model 2.6.2. We focus on the system architectures, deployment pipelines, subject creation, observability, entry management, subject migration, and all the problems we confronted with the present infrastructure, together with how and why we migrated to the brand new Kafka setup and a few classes discovered.
Kafka cluster overview
Within the fast-evolving panorama of distributed methods, VMware Tanzu CloudHealth’s next-generation microservices platform depends on Kafka as its messaging spine. For us, Kafka’s high-performance distributed log system excels in dealing with large information streams, making it indispensable for seamless communication. Serving as a distributed log system, Kafka effectively captures and shops numerous logs, from HTTP server entry logs to safety occasion audit logs.
Kafka’s versatility shines in supporting key messaging patterns, treating messages as primary logs or structured key-value shops. Dynamic partitioning and constant ordering guarantee environment friendly message group. The unwavering reliability of Kafka aligns with our dedication to information integrity.
The mixing of Ruby companies with Kafka is streamlined by the Karafka library, appearing as a higher-level wrapper. Our different language stack companies use related wrappers. Kafka’s strong debugging options and administrative instructions play a pivotal function in guaranteeing clean operations and infrastructure well being.
Kafka as an architectural pillar
In VMware Tanzu CloudHealth’s next-generation microservices platform, Kafka emerges as a crucial architectural pillar. Its capacity to deal with excessive information charges, help numerous messaging patterns, and assure message supply aligns seamlessly with our operational wants. As we proceed to innovate and scale, Kafka stays a steadfast companion, enabling us to construct a resilient and environment friendly infrastructure.
Why we migrated to Amazon MSK
For us, migrating to Amazon MSK got here down to a few key determination factors:
- Simplified technical operations – Operating Kafka on a self-managed infrastructure was an operational overhead for us. We hadn’t up to date Kafka model 2.0.0 for some time, and Kafka brokers have been happening in manufacturing, inflicting points with matters going offline. We additionally needed to run scripts manually for rising replication components and rebalancing leaders, which was further guide effort.
- Deprecated legacy pipelines and simplified permissions – We have been seeking to transfer away from our current pipelines written in Ansible to create Kafka matters on the cluster. We additionally had a cumbersome means of giving staff members entry to Kafka machines in staging and manufacturing, and we wished to simplify this.
- Value, patching, and help – As a result of Apache Zookeeper is totally managed and patched by AWS, shifting to Amazon MSK was going to avoid wasting us money and time. As well as, we found that working Amazon MSK with the identical kind of brokers on Amazon Elastic Compute Cloud (Amazon EC2) was cheaper to run on Amazon MSK. Mixed with the truth that we get safety patches utilized on brokers by AWS, migrating to Amazon MSK was a simple determination. This additionally meant that the staff was freed as much as work on different necessary issues. Lastly, getting enterprise help from AWS was additionally crucial in our remaining determination to maneuver to a managed answer.
How we migrated to Amazon MSK
With the important thing drivers recognized, we moved forward with a proposed design emigrate current self-managed Kafka to Amazon MSK. We carried out the next pre-migration steps earlier than the precise implementation:
- Evaluation:
- Carried out a meticulous evaluation of the present EC2 Kafka cluster, understanding its configurations and dependencies
- Verified Kafka model compatibility with Amazon MSK
- Amazon MSK setup with Terraform
- Community configuration:
- Ensured seamless community connectivity between the EC2 Kafka and MSK clusters, fine-tuning safety teams and firewall settings
After the pre-migration steps, we applied the next for the brand new design:
- Automated deployment, improve, and subject creation pipelines for MSK clusters:
- Within the new setup, we wished to have automated deployments and upgrades of the MSK clusters in a repeatable trend utilizing an IaC software. Due to this fact, we created customized Terraform modules for MSK cluster deployments in addition to upgrades. These modules the place referred to as from a Jenkins pipeline for automated deployments and upgrades of the MSK clusters. For Kafka subject creation, we have been utilizing an Ansible-based home-grown pipeline, which wasn’t steady and led to loads of complaints from dev groups. In consequence, we evaluated choices for deployments to Kubernetes clusters and used the Strimzi Matter Operator to create matters on MSK clusters. Matter creation was automated utilizing Jenkins pipelines, which dev groups might self-service.
- Higher observability for clusters:
- The previous Kafka clusters didn’t have good observability. We solely had alerts on Kafka dealer disk dimension. With Amazon MSK, we took benefit of open monitoring utilizing Prometheus. We stood up a standalone Prometheus server that scraped metrics from MSK clusters and despatched them to our inside observability software. Because of improved observability, we have been capable of arrange strong alerting for Amazon MSK, which wasn’t potential with our previous setup.
- Improved COGS and higher compute infrastructure:
- For our previous Kafka infrastructure, we needed to pay for managing Kafka, Zookeeper cases, plus any further dealer storage prices and information switch prices. With the transfer to Amazon MSK, as a result of Zookeeper is totally managed by AWS, we solely need to pay for Kafka nodes, dealer storage, and information switch prices. In consequence, in remaining Amazon MSK setup for manufacturing, we saved not solely on infrastructure prices but in addition operational prices.
- Simplified operations and enhanced safety:
- With the transfer to Amazon MSK, we didn’t need to handle any Zookeeper cases. Dealer safety patching was additionally taken care by AWS for us.
- Cluster upgrades grew to become less complicated with the transfer to Amazon MSK; it’s a simple course of to provoke from the Amazon MSK console.
- With Amazon MSK, we bought dealer automated scaling out of the field. In consequence, we didn’t have to fret about brokers working out of disk house, thereby resulting in further stability of the MSK cluster.
- We additionally bought further safety for the cluster as a result of Amazon MSK helps encryption at relaxation by default, and varied choices for encryption in transit are additionally obtainable. For extra data, discuss with Knowledge safety in Amazon Managed Streaming for Apache Kafka.
Throughout our pre-migration steps, we validated the setup on the staging setting earlier than shifting forward with manufacturing.
Kafka subject migration technique
With the MSK cluster setup full, we carried out an information migration of Kafka matters from the previous cluster working on Amazon EC2 to the brand new MSK cluster. To realize this, we carried out the next steps:
- Arrange MirrorMaker with Terraform – We used Terraform to orchestrate the deployment of a MirrorMaker cluster consisting of 15 nodes. This demonstrated the scalability and adaptability by adjusting the variety of nodes primarily based on the migration’s concurrent replication wants.
- Implement a concurrent replication technique – We applied a concurrent replication technique with 15 MirrorMaker nodes to expedite the migration course of. Our Terraform-driven method contributed to value optimization by effectively managing assets in the course of the migration and ensured the reliability and consistency of the MSK and MirrorMaker clusters. It additionally showcased how the chosen setup accelerates information switch, optimizing each time and assets.
- Migrate information – We efficiently migrated 2 TB of knowledge in a remarkably brief timeframe, minimizing downtime and showcasing the effectivity of the concurrent replication technique.
- Arrange post-migration monitoring – We applied strong monitoring and alerting in the course of the migration, contributing to a clean course of by figuring out and addressing points promptly.
The next diagram illustrates the structure after the subject migration was full.
Challenges and classes discovered
Embarking on a migration journey, particularly with massive datasets, is commonly accompanied by unexpected challenges. On this part, we delve into the challenges encountered in the course of the migration of matters from EC2 Kafka to Amazon MSK utilizing MirrorMaker, and share precious insights and options that formed the success of our migration.
Problem 1: Offset discrepancies
One of many challenges we encountered was the mismatch in subject offsets between the supply and vacation spot clusters, even with offset synchronization enabled in MirrorMaker. The lesson discovered right here was that offset values don’t essentially have to be an identical, so long as offset sync is enabled, which makes certain the matters have the proper place to learn the info from.
We addressed this downside by utilizing a customized software to run checks on client teams, confirming that the translated offsets have been both smaller or caught up, indicating synchronization as per MirrorMaker.
Problem 2: Sluggish information migration
The migration course of confronted a bottleneck—information switch was slower than anticipated, particularly with a considerable 2 TB dataset. Regardless of a 20-node MirrorMaker cluster, the pace was inadequate.
To beat this, the staff strategically grouped MirrorMaker nodes primarily based on distinctive port numbers. Clusters of 5 MirrorMaker nodes, every with a definite port, considerably boosted throughput, permitting us emigrate information inside hours as a substitute of days.
Problem 3: Lack of detailed course of documentation
Navigating the uncharted territory of migrating massive datasets utilizing MirrorMaker highlighted the absence of detailed documentation for such eventualities.
By trial and error, the staff crafted an IaC module utilizing Terraform. This module streamlined the whole cluster creation course of with optimized settings, enabling a seamless begin to the migration inside minutes.
Closing setup and subsequent steps
Because of the transfer to Amazon MSK, our remaining setup after subject migration seemed like the next diagram.
We’re contemplating the next future enhancements:
Conclusion.
On this publish, we mentioned how VMware Tanzu CloudHealth migrated their current Amazon EC2-based Kafka infrastructure to Amazon MSK. We walked you thru the brand new structure, deployment and subject creation pipelines, enhancements to observability and entry management, subject migration challenges, and the problems we confronted with the present infrastructure, together with how and why we migrated to the brand new Amazon MSK setup. We additionally talked about all the benefits that Amazon MSK gave us, the ultimate structure we achieved with this migration, and classes discovered.
For us, the interaction of offset synchronization, strategic node grouping, and IaC proved pivotal in overcoming obstacles and guaranteeing a profitable migration from Amazon EC2 Kafka to Amazon MSK. This publish serves as a testomony to the ability of adaptability and innovation in migration challenges, providing insights for others navigating the same path.
For those who’re working self-managed Kafka on AWS, we encourage you to attempt the managed Kafka providing, Amazon MSK.
Concerning the Authors
Rivlin Pereira is Workers DevOps Engineer at VMware Tanzu Division. He’s very captivated with Kubernetes and works on CloudHealth Platform constructing and working cloud options which are scalable, dependable and value efficient.
Vaibhav Pandey, a Workers Software program Engineer at Broadcom, is a key contributor to the event of cloud computing options. Specializing in architecting and engineering information storage layers, he’s captivated with constructing and scaling SaaS functions for optimum efficiency.
Raj Ramasubbu is a Senior Analytics Specialist Options Architect centered on massive information and analytics and AI/ML with Amazon Internet Providers. He helps clients architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj supplied technical experience and management in constructing information engineering, massive information analytics, enterprise intelligence, and information science options for over 18 years previous to becoming a member of AWS. He helped clients in varied trade verticals like healthcare, medical units, life science, retail, asset administration, automobile insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.
Todd McGrath is an information streaming specialist at Amazon Internet Providers the place he advises clients on their streaming methods, integration, structure, and options. On the private facet, he enjoys watching and supporting his 3 youngsters of their most popular actions in addition to following his personal pursuits akin to fishing, pickleball, ice hockey, and pleased hour with family and friends on pontoon boats. Join with him on LinkedIn.
Satya Pattanaik is a Sr. Options Architect at AWS. He has been serving to ISVs construct scalable and resilient functions on AWS Cloud. Prior becoming a member of AWS, he performed important function in Enterprise segments with their development and success. Exterior of labor, he spends time studying “find out how to cook dinner a flavorful BBQ” and attempting out new recipes.