
Because the Knowledge Platform staff at Databricks, we leverage our personal platform to supply an intuitive, composable, and complete Knowledge and AI platform to inside knowledge practitioners in order that they will safely analyze utilization and enhance our product and enterprise operations. As our firm matures, we’re particularly motivated to determine knowledge governance to allow safe, compliant and cost-effective knowledge operations. With hundreds of workers and a whole bunch of groups analyzing knowledge, now we have to border and implement constant requirements to attain knowledge governance at scale and continued compliance. We recognized Unity Catalog (UC), usually accessible as of August 2022, as the muse for establishing normal governance practices and thus migrating 100% of our inside lakehouse to Unity Catalog grew to become a high firm precedence.
Why migrate to Unity Catalog to attain Knowledge Governance?
Knowledge migrations are HARD – and costly. So we requested ourselves: Can we obtain our governance objectives with out migrating all the info to Unity Catalog?
We had been utilizing the default Hive Metastore (HMS) in Databricks to handle all of our tables. Constructing our personal knowledge governance options from scratch on high of HMS could be a wasteful endeavor, setting us again a number of quarters. Unity Catalog, alternatively, supplied great worth out of the field:
- Any knowledge on HMS was readable by anyone. UC securely helps fine-grained entry.
- HMS doesn’t present lineage or audit logs. Lineage assist is essential to understanding knowledge flows and empowering efficient knowledge lifecycle administration. Together with audit logs, this offers observability about knowledge adjustments and propagation.
- With higher integration with the in-product search characteristic, UC permits a greater expertise for customers to annotate and uncover high-quality knowledge.
- Delta Sharing, question federation and catalog binding present efficient choices to create cross-region knowledge meshes with out creating safety or compliance dangers.
Unity Catalog migration begins with a governance technique
At a excessive stage, we may go down one in all two paths:
- Carry-and-shift: Copy all of the schemas and tables as is from legacy HMS to a UC catalog whereas giving everyone learn entry to all knowledge. This path is low stage of effort within the brief time period. Nevertheless, we danger bringing alongside outdated datasets and incoherent/unhealthy practices motivated by HMS or natural development. The likelihood of needing a number of giant subsequent migrations to scrub in place could be excessive.
- Transformational: Selectively migrate datasets whereas establishing a core construction for knowledge group in Unity Catalog. Whereas this path requires extra effort within the brief time period, it offers a significant course-correction alternative. Subsequent rounds of incremental (smaller) clean-up could also be vital.
We selected the latter. It allowed us to put the groundwork to introduce future governance coverage whereas offering the requisite skeleton to construct round. We constructed infrastructure to allow paved paths that ensured clear knowledge possession, naming conventions and intentional entry, versus opening entry to all workers by default.
One such instance is the catalog group technique we selected upfront:
Catalog | Objective | Governance |
---|---|---|
Customers | Particular person consumer areas (schemas) |
|
Workforce | Collaborative areas for customers who work collectively |
|
Integration | House for particular integration tasks throughout groups |
|
Essential | Manufacturing setting. |
|
Challenges
Our inside knowledge lake had grow to be extra of a “knowledge swamp” over time, as a result of beforehand highlighted lack of lineage and entry controls in HMS. We didn’t have solutions to three fundamental questions crucial to any migration:
- Who owns desk foo?
- Are all of the tables upstream of foo already migrated to the brand new location?
- Who’re all of the downstream clients of desk foo that should be up to date?
Now think about that lack of visibility on the scale of our knowledge lake:

Now think about a four-person engineering staff pulling this off with none devoted program administration assist in 10 months.
Our Strategy
The migration can virtually be damaged down into 4 phases.
Part 1: Make a Plan, by Unlocking Lineage for HMS
We collaborated with the Unity Catalog and Discovery groups to construct knowledge a lineage pipeline for HMS on inside Databricks workspaces. This allowed us to determine the next:
A. Who up to date a desk and when?
B. Who reads from a desk and when?
C. Whether or not the info was consumed through a dashboard, a question, a job or a pocket book?
A allowed us to deduce the more than likely house owners of the tables. B and C helped set up the “blast radius” of an imminent migration i.e., who’re all of the downstream shoppers to inform and which of them are mission crucial? Moreover, B allowed us to estimate how a lot “stale” knowledge was mendacity round within the knowledge lake that might be merely ignored (and ultimately deleted) to simplify the migration.
This observability was crucial in estimating the general migration effort, creating a sensible timeline for the corporate and informing what tooling, automation and governance insurance policies our staff wanted to spend money on.
After proving its utility internally, we now present our clients a path to allow HMS Lineage for a restricted time frame to help with the migration to Unity Catalog. Discuss to your account consultant to allow it.
Part 2: Cease the Bleeding, by Implementing Knowledge Retention
Lineage observability revealed two crucial insights:
- There have been a ton of “stale” tables within the knowledge lake, that had not been consumed shortly, and had been most likely not price migrating
- The brand new desk creation price on HMS was pretty excessive. This needed to be introduced down considerably (virtually 0) for us to efficiently cutover to Unity Catalog ultimately and have a shot at a profitable migration.
These insights led us to spend money on knowledge retention infrastructure upfront and roll out the next insurance policies, which turbo-charged our effort.
- Rubbish-Accumulate Stale Knowledge: This coverage, shipped proper out of the gates, deleted any HMS desk that wasn’t up to date for 30 days. We supplied groups with a grace interval to register exemptions. This enormously decreased the dimensions of the “haystack” and allowed knowledge practitioners to deal with knowledge that really mattered.
- No New Tables in HMS: 1 / 4 after the migration was underway and there was company-wide consciousness, we rolled out a coverage to stop the creation of any new HMS tables. Whereas conserving the legacy metastore in test, this measure successfully positioned a moratorium on knowledge pipelines nonetheless on HMS as they might now not be prolonged or modified to supply new tables.

With these in place, we had been now not chasing a shifting goal.
Part 3: Distribute the work, utilizing Self-Serve Monitoring Instruments
Most organizations within the firm have a unique cadence for planning, totally different processes for monitoring execution and totally different priorities and constraints. As a small knowledge platform staff, our objective was to attenuate coordination and empower groups to confidently estimate, coordinate, and monitor their OWN dataset migration efforts. To this finish, we turned the lineage observability knowledge into executive-level dashboards, the place every staff may perceive the excellent work on their plate, each as knowledge producers and shoppers, ordered by significance. These allowed additional drill-downs to the supervisor and particular person contributor ranges. These had been up to date on a every day cadence for progress-tracking functions.
Moreover, the info was aggregated right into a leaderboard, permitting management to have visibility and apply stress when required. The worldwide monitoring dashboard additionally served the twin goal of a lookup desk the place shoppers may discover the areas of recent tables migrated to Unity Catalog.
The emphasis on managing the folks and course of dynamics of the Databricks group was an important success driver. Each group is totally different and tailoring your method to your organization is vital to your success.
Part 4: Sort out the Lengthy Tail, utilizing Automation
Successfully herding the lengthy tail is make or break for a migration with 2.5K knowledge shoppers and over 50K consuming entities throughout each staff of the corporate. Counting on knowledge producers or our small platform staff to trace and chase down all these shoppers to do their half by the deadline was a non-starter.
Below the moniker “Migration Wizard”, we constructed an information platform that allowed knowledge producers to register the tables to be deleted or migrated to a catalog in Unity Catalog. Together with the desk paths (new and previous), producers supplied operational metadata just like the end-of-life (EOL) date for the legacy desk and learn how to contact with questions or issues.
The Migration Wizard would then:
- Leverage lineage to detect consumption and notify downstream groups. This focused method allowed groups to not should repeatedly inundate everyone with knowledge deprecation messages
- On EOL day, render a “gentle deletion” through lack of entry and purge the info per week later
- Auto-update DBSQL queries relying on the legacy knowledge to learn from the brand new location

Thus with just a few traces of config, the info producer was successfully and confidently decoupled from the migration effort with out having to fret about downstream affect. Automation continued notifying clients and likewise supplied a swift repair for question breakage found after the deprecation set off was pulled.
Subsequently, the flexibility to auto-update DBSQL and pocket book queries from legacy HMS tables to new UC options has been added to the product to help our clients of their journey to Unity Catalog.
Sticking the Touchdown
In February 2024, we eliminated entry to Hive Metastore and began deleting all remaining legacy knowledge. Given the quantity of communication and coordination, this doubtlessly disruptive change turned out to be clean. Our adjustments didn’t set off any incidents, and we had been in a position to declare “Success” quickly after.

We noticed quick value advantages as unowned jobs that failed as a result of adjustments may now be turned off. Dashboards silently deprecated now failed whereas incurring marginal compute value and might be equally sunsetted.
A crucial goal was to determine options to make migration to Unity Catalog simpler for Databricks clients. The Unity Catalog and different product groups gathered intensive actionable suggestions for product enhancements. The Knowledge Platform staff prototyped, proposed and architected varied options that shall be rolling out to clients shortly.
The Journey Continues
The transfer to Unity Catalog unshackled knowledge practitioners, considerably decreasing knowledge sprawl and unlocking new options. For instance, the Advertising Analytics staff noticed a 10x discount in tables managed through a lineage-enabled identification (and deletion) of deprecated datasets. Entry administration enhancements and lineage, alternatively, have enabled highly effective one-click entry obtainment paths and entry discount automation.
For extra on this, try our discuss on unified governance @ Knowledge + AI Summit 2024. In future blogs on this sequence, we will even dive deeper into governance selections. Keep tuned for extra about our journey to Knowledge Governance!
We want to thank Vinod Marur, Sam Shah and Bruce Wong for his or her management and assist and Product Engineering @ Databricks—particularly Unity Catalog and Knowledge Discovery—for his or her continued partnership on this journey.