Hash Functions in Microsoft Fabric: Decision Framework
Beginner-friendly guide to choosing hash functions across Microsoft Fabric. Why Spark hash() breaks at scale, how to make SHA-256 match across Spark, Warehouse, and KQL, and a top-5 comparison on F64 SKU.
TL;DR
In Microsoft Fabric, the right hash function depends on whether the hash crosses engine boundaries: use xxhash64 for single-engine Spark work, sha2(..., 256) when the hash must match across Spark, Warehouse, and KQL, and never use Spark's default hash() for change detection (collides at ~77,000 rows).
The short version:
- Never use Spark
hash()to track row changes or build keys. It is a 32-bit hash that starts colliding around 77,000 rows. - Use
xxhash64when the hash stays inside Spark and the table is under a few billion rows. - Use
sha2(..., 256)when the hash leaves Spark (Warehouse, KQL, Python) or has to live for years. - For PII, use
sha2(value + salt, 256)with the salt stored in Key Vault. - For SHA-256 to match across engines, everyone must use the same byte encoding. Pick UTF-8 and enforce it.
The rest of this article shows why, with one realistic benchmark on an F64 capacity.
The trap most teams fall into
A teammate needs to detect changes in an SCD Type 2 pipeline. They reach for the most familiar PySpark function:
df = df.withColumn("row_hash", F.hash(*business_key_cols))
It compiles. It runs. It ships. Twelve months later, the dimension is short rows. The audit log shows the pipeline processed them; the table does not have them. The bug is not in the SCD2 merge logic. The bug is that hash() in Spark is a 32-bit Murmur3 algorithm, and the table has crossed 100 million rows.
How likely is a collision? At 77,163 rows in a 32-bit space, the probability of at least one collision is over 50% (the birthday paradox bound). At 100 million rows, the expected number of collisions is roughly 1.16 million. That is not a corner case. That is the default.
Border thickness, rank number, and plain-text safety label all encode severity, so the diagram is readable without color.
Rule of thumb
Never use Spark hash() for change detection, idempotency keys, or row signatures. Use xxhash64 for single-engine work above one million rows. Use sha2(..., 256) when the hash crosses an engine boundary or has to live for years.
The top 5 hash functions in Microsoft Fabric
These are the five you will actually pick from. Stability means the algorithm is guaranteed not to change between platform versions.
Two notes on stability:
- KQL
hash()is not stable. The docs say the algorithm "may change." Never persist its output. Usehash_sha256()for anything that crosses the query boundary. - DAX has a
HASHfunction. It is undocumented and unsupported. Do not use it. Materialize hashes upstream.
How they actually perform on F64
On an F64 capacity (medium Spark pool, around 64 vCores total), hashing a 100 million row table with five string columns gives roughly these numbers. They are directional only. The bottleneck at this scale is usually I/O, not CPU; run your own benchmark before defending these to finance.
| Function | Width | Compute time (100M × 5 cols) | Relative CU cost | Cross-engine readable |
|---|---|---|---|---|
xxhash64() | 64-bit | ~2 to 5 seconds | Lowest | No (Spark only) |
hash() (Murmur3) | 32-bit | ~2 to 5 seconds | Lowest | No (Spark only) |
md5() | 128-bit | ~6 to 12 seconds | Low | Yes |
sha2(col, 256) | 256-bit | ~15 to 35 seconds | Medium | Yes (matches T-SQL and KQL when encoded UTF-8) |
HASHBYTES('SHA2_256', col) | 256-bit | ~25 to 70 seconds (Warehouse engine) | Medium-high | Yes (origin engine) |
How to read this:
- For a nightly batch of 100 million rows, the difference between
xxhash64andsha2-256is roughly 20 to 30 seconds. That is rarely the cost driver, so pick the right tool for correctness, not speed. - For a million-row hourly job,
sha2-256finishes well under a second on F64. Speed is a non-issue. - If you are hashing 10 billion rows per run, run an actual benchmark. The numbers above are not extrapolatable that far.
The relative ordering (xxHash > Murmur3 > MD5 > SHA-1 > SHA-256 > SHA-512) is consistent across published benchmarks. Absolute throughput depends on CPU instruction set; the Fabric Native Execution Engine may shift these by accelerating cryptographic implementations differently.
Making SHA-256 match across engines
The most common cross-engine pattern in Microsoft Fabric: Spark writes a hash to a Delta table, T-SQL Warehouse reads it through the SQL analytics endpoint, KQL queries it via mirrored Eventhouse, Python verifies it from a notebook. For the hashes to match byte for byte, the input must be encoded the same way in every engine.
The silent failure
Fabric Warehouse table columns can only be VARCHAR (UTF-8 collation, no NVARCHAR column type). The encoding mismatch shows up when NVARCHAR slips in via a CAST expression, a variable, or a stored procedure migrated from SQL Server or Synapse. Different bytes in, different hash out, no error message.
Try it yourself with the literal customer ID CUST-12345 (a realistic SCD2 key):
-- T-SQL Warehouse, VARCHAR (UTF-8)
SELECT HASHBYTES('SHA2_256', CAST('CUST-12345' AS VARCHAR(16)));
-- 0x0E3A6606CF6E32F1...2B6DB211
-- T-SQL Warehouse, NVARCHAR cast (UTF-16LE; not a column type in Fabric Warehouse, but legal in expressions)
SELECT HASHBYTES('SHA2_256', CAST('CUST-12345' AS NVARCHAR(16)));
-- 0x9ABD0CEE9E00EBE6...8E3E6156
-- Spark, default sha2 (UTF-8 internally)
SELECT sha2('CUST-12345', 256);
-- 0e3a6606cf6e32f1...2b6db211
-- Spark, forced UTF-16LE to match T-SQL NVARCHAR
SELECT sha2(encode('CUST-12345', 'UTF-16LE'), 256);
-- 9abd0cee9e00ebe6...8e3e6156
-- KQL Eventhouse
print hash_sha256('CUST-12345')
// 0e3a6606cf6e32f1...2b6db211
-- Python
import hashlib; hashlib.sha256('CUST-12345'.encode('utf-8')).hexdigest()
# '0e3a6606cf6e32f1...2b6db211'
Same string. Two hashes. The difference is the encoding.
Two paths, two hashes. The thick border on Path B marks the one that breaks silently when an NVARCHAR cast slips into a HASHBYTES call.
One encoding, everywhere
Pick UTF-8 across the platform. That means VARCHAR on T-SQL, default sha2 on Spark, native hash_sha256 on KQL, and 'utf-8' in Python. Write it on the engineering wiki. Add it to code review.
Handling NULLs without breaking the hash
Two Spark functions handle NULLs differently, and both can ruin a row signature.
concat('a', NULL, 'b')returnsNULL. Any NULL in any column kills the entire hash.concat_ws('|', 'a', NULL, 'b')returns'a|b'. The NULL is silently dropped, so('a', NULL, 'b')and('a', '', 'b')produce the same hash even though they are different rows.
The safe pattern is to replace every NULL with a sentinel character that cannot appear in real data, then concatenate:
from pyspark.sql import functions as F
KEY_COLS = ["customer_id", "region_code", "product_sku"]
NULL_SENTINEL = "␀" # U+2400 SYMBOL FOR NULL
SEPARATOR = "\x1f" # U+001F UNIT SEPARATOR
row_hash = F.sha2(
F.encode(
F.concat_ws(
SEPARATOR,
*[F.coalesce(F.col(c).cast("string"), F.lit(NULL_SENTINEL)) for c in KEY_COLS]
),
"UTF-8"
),
256
)
df = df.withColumn("row_hash", row_hash)
Three properties: NULLs are distinct from empty strings, the separator cannot collide with real data, and the result matches HASHBYTES('SHA2_256', VARCHAR) over the same concatenated string.
PII pseudonymization
For PII columns (email, phone, ID), four things matter:
- Cryptographic strength rules out Murmur3, xxHash, MD5, and SHA-1.
- A salt prevents rainbow-table attacks against common values.
- The salt must come from Key Vault, never the notebook.
- The same input must produce the same hash so joins still work.
from pyspark.sql import functions as F
# notebookutils is a Fabric notebook built-in; no import needed.
# Retrieve salt from Key Vault, never hardcode.
salt = notebookutils.credentials.getSecret(
"https://my-kv.vault.azure.net/", "pii-pepper-v1"
)
pii_hash = F.sha2(F.concat(F.col("email"), F.lit(salt)), 256)
df = df.withColumn("email_pseudonym", pii_hash).drop("email")
Version the salt name (pii-pepper-v1), drop the original PII column in the same transformation, and document the encoding. For GDPR right-to-be-forgotten you still need to disassociate or delete the source row; hashing reduces risk, it does not eliminate it.
Bloom filter indexes are deprecated
If you arrived here looking for hash-based data skipping in Delta, don't. As of April 2026, Bloom filter indexes are deprecated in Azure Databricks documentation, which Microsoft Fabric inherits. The current guidance:
Do not use Bloom filter indexes. Azure Databricks has deprecated this feature... They add write overhead, are difficult to tune, and are superseded by more effective alternatives.
Use V-Order (the Fabric default), Z-Order, or liquid clustering instead. If you have existing Bloom filter indexes, drop them and run VACUUM.
A production-grade SCD Type 2 pattern
Combining everything, here is what a reliable row signature looks like for Fabric Runtime 1.3:
from pyspark.sql import functions as F
NATURAL_KEY = ["customer_id"]
TRACKED_COLS = ["customer_name", "country_code", "region_code", "email_pseudonym", "tier"]
NULL_SENTINEL = "␀"
SEPARATOR = "\x1f"
def row_signature(cols):
"""SHA-256 of selected columns, NULL-safe, cross-engine reproducible."""
return F.sha2(
F.encode(
F.concat_ws(
SEPARATOR,
*[F.coalesce(F.col(c).cast("string"), F.lit(NULL_SENTINEL)) for c in cols]
),
"UTF-8"
),
256
)
incoming = source_df.withColumn("row_hash", row_signature(TRACKED_COLS))
current = (
spark.table("silver.dim_customer")
.filter(F.col("is_current") == True)
.select(*NATURAL_KEY, "row_hash", "surrogate_key")
)
joined = incoming.alias("i").join(current.alias("c"), on=NATURAL_KEY, how="left")
changed = joined.filter(
F.col("c.row_hash").isNotNull() & (F.col("i.row_hash") != F.col("c.row_hash"))
)
new = joined.filter(F.col("c.row_hash").isNull())
# Downstream MERGE handles the writes
This pattern survives NULL columns, matches HASHBYTES('SHA2_256', VARCHAR) over the same concatenated input, cannot collide at any realistic Fabric data volume, and is stable across Spark versions.
What to watch for
- Runtime 2.0 (Spark 4.0) Public Preview. Hash semantics stay the same across major Spark versions, but the Native Execution Engine path is being expanded to Runtime 2.0. Benchmark before migrating throughput-sensitive pipelines.
- VARCHAR vs NVARCHAR in Warehouse. Default string type behavior in Fabric Warehouse evolves. Spell out the encoding in your table DDL; do not rely on defaults.
- DAX HASH function. Still undocumented and unsupported. Until that changes, materialize hashes upstream in Spark or T-SQL.