
Dataricks is thrilled to announce the Normal Availability (GA) of Main Key (PK) and International Key (FK) constraints, beginning in Databricks Runtime 15.2 and Databricks SQL 2024.30. This launch follows a extremely profitable public preview, embraced by tons of of weekly lively clients, and additional represents a big milestone in enhancing information integrity and relational information administration throughout the Lakehouse.
Moreover, Databricks can now use these constraints to optimize queries and eradicate pointless operations from the question plan, delivering a lot sooner efficiency.
Main Key and International Key Constraints
Main Keys (PKs) and International Keys (FKs) are important parts in relational databases, appearing as basic constructing blocks for information modeling. They supply details about the information relationships within the schema to customers, instruments and purposes; and allow optimizations that leverage constraints to hurry up queries. Main and international keys are actually usually accessible on your Delta Lake tables hosted in Unity Catalog.
SQL Language
You possibly can outline constraints once you create a desk:
CREATE TABLE Customers (
UserID INT NOT NULL PRIMARY KEY,
UserName STRING,
E-mail STRING,
SignUpDate DATE
);
Within the above instance, we outline a main key constraint on the column UserID
. Databricks additionally helps constraints on teams of columns as properly.
It’s also possible to modify present Delta tables so as to add or take away constraints:
CREATE TABLE Merchandise (
ProductID INT NOT NULL,
ProductName STRING,
Worth DECIMAL(10,2),
CategoryID INT
);
ALTER TABLE Merchandise ADD CONSTRAINT products_pk PRIMARY KEY (ProductID);
ALTER TABLE Merchandise DROP CONSTRAINT products_pk;
Right here we create the first key named products_pk
on the non-nullable column ProductID
in an present desk. To efficiently execute this operation, you have to be the proprietor of the desk. Be aware that constraint names have to be distinctive throughout the schema.
The next command removes the first key by specifying the identify.
The identical course of applies for international keys. The next desk defines two international keys at desk creation time:
CREATE TABLE Purchases (
PurchaseID INT PRIMARY KEY,
UserID INT,
ProductID INT,
PurchaseDate DATE,
Amount INT,
FOREIGN KEY (UserID) REFERENCES Customers(UserID),
FOREIGN KEY (ProductID) REFERENCES Merchandise(ProductID)
);
Please confer with the documentation on CREATE TABLE and ALTER TABLE statements for extra particulars on the syntax and operations associated to constraints.
Main key and international key constraints aren’t enforced within the Databricks engine, however they could be helpful for indicating a knowledge integrity relationship that’s supposed to carry true. Databricks can as an alternative implement main key constraints upstream as a part of the ingest pipeline. See Managed information high quality with Delta Stay Tables for extra data on enforced constraints. Databricks additionally helps enforced NOT NULL
and CHECK
constraints (see the Constraints documentation for extra data).
Accomplice Ecosystem
Instruments and purposes resembling the newest model of Tableau and PowerBI can mechanically import and make the most of your main key and international key relationships from Databricks by JDBC and ODBC connectors.
View the constraints
There are a number of methods to view the first key and international key constraints outlined within the desk. It’s also possible to merely use SQL instructions to view constraint data with the DESCRIBE TABLE EXTENDED
command:
> DESCRIBE TABLE EXTENDED Purchases
... (omitting different outputs)
# Constraints
purchases_pk PRIMARY KEY (`PurchaseID`)
purchases_products_fk FOREIGN KEY (`ProductID`) REFERENCES `important`
.`instance`.`merchandise` (`ProductID`)
purchases_users_fk FOREIGN KEY (`UserID`) REFERENCES `important`
.`instance`.`customers` (`UserID`)
Catalog Explorer and Entity Relationship Diagram
It’s also possible to view the constraints data by the Catalog Explorer:

Every main key and international key column has a small key icon subsequent to its identify.
And you may visualize the first and international key data and the relationships between tables with the Entity Relationship Diagram in Catalog Explorer. Under is an instance of a desk purchases
referencing two tables, customers
and merchandise
:

INFORMATION SCHEMA
The next INFORMATION_SCHEMA tables additionally present constraint data:
Use the RELY choice to allow optimizations
If that the first key constraint is legitimate, (for instance, as a result of your information pipeline or ETL job enforces it) then you may allow optimizations primarily based on the constraint by specifying it with the RELY choice, like:
PRIMARY KEY (c_customer_sk) RELY
Utilizing the RELY choice lets Databricks optimize queries in ways in which rely upon the constraint’s validity, since you are guaranteeing that the information integrity is maintained. Train warning right here as a result of if a constraint is marked as RELY however the information violates the constraint, your queries could return incorrect outcomes.
When you don’t specify the RELY choice for a constraint, the default is NORELY, through which case constraints should still be used for informational or statistical functions, however queries won’t depend on them to run appropriately.
The RELY choice and the optimizations using it are presently accessible for main keys, and also will be coming quickly for international keys.
You possibly can modify a desk’s main key to alter whether or not it’s RELY or NORELY by utilizing ALTER TABLE, for instance:
ALTER TABLE buyer DROP PRIMARY KEY;
ALTER TABLE buyer ADD PRIMARY KEY (c_customer_sk) RELY
Pace up your queries by eliminating pointless aggregations
One easy optimization we are able to do with RELY main key constraints is eliminating pointless aggregates. For instance, in a question that’s making use of a definite operation over a desk with a main key utilizing RELY:
SELECT DISTINCT c_customer_sk FROM buyer;
We are able to take away the pointless DISTINCT operation:
SELECT c_customer_sk FROM buyer;
As you may see, this question depends on the validity of the RELY main key constraint – if there are duplicate buyer IDs within the buyer desk, then the remodeled question will return incorrect duplicate outcomes. You’re answerable for imposing the validity of the constraint if you happen to set the RELY choice.
If the first secret is NORELY (the default), then the optimizer won’t take away the DISTINCT operation from the question. Then it might run slower however all the time returns right outcomes even when there are duplicates. If the first secret is RELY, Databricks can take away the DISTINCT operation, which may significantly velocity up the question – by about 2x for the above instance.
Pace up your queries by eliminating pointless joins
One other very helpful optimization we are able to carry out with RELY main keys is eliminating pointless joins. If a question joins a desk that’s not referenced anyplace besides within the be a part of situation, then the optimizer can decide that the be a part of is pointless, and take away the be a part of from the question plan.
To present an instance, as an instance we now have a question becoming a member of two tables, store_sales
and buyer
, joined on the first key of the shopper desk PRIMARY KEY (c_customer_sk) RELY
.
SELECT SUM(ss_quantity)
FROM store_sales ss
LEFT JOIN buyer c
ON ss_customer_sk = c_customer_sk;
If we did not have the first key, every row of store_sales
might probably match a number of rows in buyer
, and we’d have to execute the be a part of to compute the right SUM worth. However as a result of the desk buyer
is joined on its main key, we all know that the be a part of will output one row for every row of store_sales
.
So the question solely truly wants the column ss_quantity
from the very fact desk store_sales
. Due to this fact, the question optimizer can solely eradicate the be a part of from the question, reworking it into:
SELECT SUM(ss_quantity)
FROM store_sales ss
This runs a lot sooner by avoiding all the be a part of – on this instance we observe the optimization velocity up the question from 1.5 minutes to six seconds!. And the advantages might be even bigger when the be a part of entails many tables that may be eradicated!

Chances are you’ll ask, why would anybody run a question like this? It is truly way more frequent than you would possibly suppose! One frequent motive is that customers assemble views that be a part of collectively a number of tables, resembling becoming a member of collectively many reality and dimension tables. They write queries over these views which regularly use columns from solely a few of the tables, not all – and so the optimizer can eradicate the joins in opposition to the tables that are not wanted in every question. This sample can also be frequent in lots of Enterprise Intelligence (BI) instruments, which regularly generate queries becoming a member of many tables in a schema even when a question solely makes use of columns from a few of the tables.
Conclusion
Since its public preview, over 2600 + Databricks clients have used main key and international key constraints. Immediately, we’re excited to announce the final availability of this characteristic, marking a brand new stage in our dedication to enhancing information administration and integrity in Databricks.
Moreover, Databricks now takes benefit of key constraints with the RELY choice to optimize queries, resembling by eliminating pointless aggregates and joins, leading to a lot sooner question efficiency.