Skip to content
psistla.com

Hash Functions in Microsoft Fabric: Decision Framework

Beginner-friendly guide to choosing hash functions across Microsoft Fabric. Why Spark hash() breaks at scale, how to make SHA-256 match across Spark, Warehouse, and KQL, and a top-5 comparison on F64 SKU.

Prasanth Sistla8 min read

TL;DR

In Microsoft Fabric, the right hash function depends on whether the hash crosses engine boundaries: use xxhash64 for single-engine Spark work, sha2(..., 256) when the hash must match across Spark, Warehouse, and KQL, and never use Spark's default hash() for change detection (collides at ~77,000 rows).

The short version:

  • Never use Spark hash() to track row changes or build keys. It is a 32-bit hash that starts colliding around 77,000 rows.
  • Use xxhash64 when the hash stays inside Spark and the table is under a few billion rows.
  • Use sha2(..., 256) when the hash leaves Spark (Warehouse, KQL, Python) or has to live for years.
  • For PII, use sha2(value + salt, 256) with the salt stored in Key Vault.
  • For SHA-256 to match across engines, everyone must use the same byte encoding. Pick UTF-8 and enforce it.

The rest of this article shows why, with one realistic benchmark on an F64 capacity.

Info

Scope: May 21, 2026. Microsoft Fabric Runtime 1.3 (current GA, Apache Spark 3.5, Delta Lake 3.2) plus Warehouse, Eventhouse KQL, and DAX. Runtime 2.0 (Spark 4.0) is in Public Preview and out of scope here.

The trap most teams fall into

A teammate needs to detect changes in an SCD Type 2 pipeline. They reach for the most familiar PySpark function:

df = df.withColumn("row_hash", F.hash(*business_key_cols))

It compiles. It runs. It ships. Twelve months later, the dimension is short rows. The audit log shows the pipeline processed them; the table does not have them. The bug is not in the SCD2 merge logic. The bug is that hash() in Spark is a 32-bit Murmur3 algorithm, and the table has crossed 100 million rows.

How likely is a collision? At 77,163 rows in a 32-bit space, the probability of at least one collision is over 50% (the birthday paradox bound). At 100 million rows, the expected number of collisions is roughly 1.16 million. That is not a corner case. That is the default.

Hash collision probability by bit widthFour ranked tiers from danger (32-bit) to safe (256-bit), each labeled with algorithm and row count for 50% collision probability.When does a hash collision become probable?50% collision probability by hash width (birthday bound)132-bitSpark hash() / T-SQL CHECKSUMMurmur3 32-bit spaceDANGER50% collision at ~77,000 rows264-bitSpark xxhash64 / KQL hashxxHash 64-bit spaceCAUTION50% collision at ~5 billion rows3128-bitMD5128-bit spaceSAFE50% collision at ~22 quintillion rows4256-bitSHA-2 256256-bit spaceSAFE PLUS50% collision astronomically improbable

Border thickness, rank number, and plain-text safety label all encode severity, so the diagram is readable without color.

Rule of thumb

Never use Spark hash() for change detection, idempotency keys, or row signatures. Use xxhash64 for single-engine work above one million rows. Use sha2(..., 256) when the hash crosses an engine boundary or has to live for years.

The top 5 hash functions in Microsoft Fabric

These are the five you will actually pick from. Stability means the algorithm is guaranteed not to change between platform versions.

FunctionWhereWidthStable?Use it for
hash()Spark32-bitYesNothing user-facing. Engine-internal shuffle only.
xxhash64()Spark64-bitYesFast row signatures inside Spark, under a few billion rows
sha2(col, 256)Spark256-bitYesCross-engine signatures, PII (with salt), long-lived keys
HASHBYTES('SHA2_256', ...)Warehouse / SQL endpoint256-bitYesCross-engine signatures read by T-SQL
hash_sha256()KQL Eventhouse256-bitYesCross-engine signatures read by KQL

Two notes on stability:

  • KQL hash() is not stable. The docs say the algorithm "may change." Never persist its output. Use hash_sha256() for anything that crosses the query boundary.
  • DAX has a HASH function. It is undocumented and unsupported. Do not use it. Materialize hashes upstream.

How they actually perform on F64

On an F64 capacity (medium Spark pool, around 64 vCores total), hashing a 100 million row table with five string columns gives roughly these numbers. They are directional only. The bottleneck at this scale is usually I/O, not CPU; run your own benchmark before defending these to finance.

FunctionWidthCompute time (100M × 5 cols)Relative CU costCross-engine readable
xxhash64()64-bit~2 to 5 secondsLowestNo (Spark only)
hash() (Murmur3)32-bit~2 to 5 secondsLowestNo (Spark only)
md5()128-bit~6 to 12 secondsLowYes
sha2(col, 256)256-bit~15 to 35 secondsMediumYes (matches T-SQL and KQL when encoded UTF-8)
HASHBYTES('SHA2_256', col)256-bit~25 to 70 seconds (Warehouse engine)Medium-highYes (origin engine)

How to read this:

  • For a nightly batch of 100 million rows, the difference between xxhash64 and sha2-256 is roughly 20 to 30 seconds. That is rarely the cost driver, so pick the right tool for correctness, not speed.
  • For a million-row hourly job, sha2-256 finishes well under a second on F64. Speed is a non-issue.
  • If you are hashing 10 billion rows per run, run an actual benchmark. The numbers above are not extrapolatable that far.

The relative ordering (xxHash > Murmur3 > MD5 > SHA-1 > SHA-256 > SHA-512) is consistent across published benchmarks. Absolute throughput depends on CPU instruction set; the Fabric Native Execution Engine may shift these by accelerating cryptographic implementations differently.

Making SHA-256 match across engines

The most common cross-engine pattern in Microsoft Fabric: Spark writes a hash to a Delta table, T-SQL Warehouse reads it through the SQL analytics endpoint, KQL queries it via mirrored Eventhouse, Python verifies it from a notebook. For the hashes to match byte for byte, the input must be encoded the same way in every engine.

The silent failure

Fabric Warehouse table columns can only be VARCHAR (UTF-8 collation, no NVARCHAR column type). The encoding mismatch shows up when NVARCHAR slips in via a CAST expression, a variable, or a stored procedure migrated from SQL Server or Synapse. Different bytes in, different hash out, no error message.

Try it yourself with the literal customer ID CUST-12345 (a realistic SCD2 key):

-- T-SQL Warehouse, VARCHAR (UTF-8)
SELECT HASHBYTES('SHA2_256', CAST('CUST-12345' AS VARCHAR(16)));
-- 0x0E3A6606CF6E32F1...2B6DB211

-- T-SQL Warehouse, NVARCHAR cast (UTF-16LE; not a column type in Fabric Warehouse, but legal in expressions)
SELECT HASHBYTES('SHA2_256', CAST('CUST-12345' AS NVARCHAR(16)));
-- 0x9ABD0CEE9E00EBE6...8E3E6156

-- Spark, default sha2 (UTF-8 internally)
SELECT sha2('CUST-12345', 256);
-- 0e3a6606cf6e32f1...2b6db211

-- Spark, forced UTF-16LE to match T-SQL NVARCHAR
SELECT sha2(encode('CUST-12345', 'UTF-16LE'), 256);
-- 9abd0cee9e00ebe6...8e3e6156

-- KQL Eventhouse
print hash_sha256('CUST-12345')
// 0e3a6606cf6e32f1...2b6db211

-- Python
import hashlib; hashlib.sha256('CUST-12345'.encode('utf-8')).hexdigest()
# '0e3a6606cf6e32f1...2b6db211'

Same string. Two hashes. The difference is the encoding.

SHA-256 of the string CUST-12345 across Microsoft Fabric enginesUTF-8 path (Spark default, T-SQL VARCHAR, KQL, Python) produces hash 0e3a6606. UTF-16LE path (Spark with encode, T-SQL NVARCHAR) produces hash 9abd0cee.Same string. Different encoding. Different hash.SHA-256 of "CUST-12345" across Microsoft Fabric enginesINPUT STRING"CUST-12345"encode UTF-8encode UTF-16LEPATH A: UTF-8FUNCTIONS THAT TAKE THIS PATH• Spark sha2(col, 256)• T-SQL HASHBYTES('SHA2_256', VARCHAR)• KQL hash_sha256()• Python hashlib.sha256(b'CUST-12345')ALL FOUR PRODUCE0e3a6606 cf6e32f1c923d9e3 3109beb7dde1f47c 226fa4f5a9600f61 2b6db211(SHA-256 hex)PATH B: UTF-16LEFUNCTIONS THAT TAKE THIS PATH• Spark sha2(encode(col, 'UTF-16LE'), 256)• T-SQL HASHBYTES('SHA2_256', NVARCHAR)BOTH PRODUCE9abd0cee 9e00ebe6f9e8e32e c1f150ad4d774e7c 796e8c47fefbdcd4 8e3e6156(SHA-256 hex)

Two paths, two hashes. The thick border on Path B marks the one that breaks silently when an NVARCHAR cast slips into a HASHBYTES call.

One encoding, everywhere

Pick UTF-8 across the platform. That means VARCHAR on T-SQL, default sha2 on Spark, native hash_sha256 on KQL, and 'utf-8' in Python. Write it on the engineering wiki. Add it to code review.

Handling NULLs without breaking the hash

Two Spark functions handle NULLs differently, and both can ruin a row signature.

  • concat('a', NULL, 'b') returns NULL. Any NULL in any column kills the entire hash.
  • concat_ws('|', 'a', NULL, 'b') returns 'a|b'. The NULL is silently dropped, so ('a', NULL, 'b') and ('a', '', 'b') produce the same hash even though they are different rows.

The safe pattern is to replace every NULL with a sentinel character that cannot appear in real data, then concatenate:

from pyspark.sql import functions as F

KEY_COLS = ["customer_id", "region_code", "product_sku"]
NULL_SENTINEL = "␀"       # U+2400 SYMBOL FOR NULL
SEPARATOR = "\x1f"         # U+001F UNIT SEPARATOR

row_hash = F.sha2(
    F.encode(
        F.concat_ws(
            SEPARATOR,
            *[F.coalesce(F.col(c).cast("string"), F.lit(NULL_SENTINEL)) for c in KEY_COLS]
        ),
        "UTF-8"
    ),
    256
)
df = df.withColumn("row_hash", row_hash)

Three properties: NULLs are distinct from empty strings, the separator cannot collide with real data, and the result matches HASHBYTES('SHA2_256', VARCHAR) over the same concatenated string.

PII pseudonymization

For PII columns (email, phone, ID), four things matter:

  1. Cryptographic strength rules out Murmur3, xxHash, MD5, and SHA-1.
  2. A salt prevents rainbow-table attacks against common values.
  3. The salt must come from Key Vault, never the notebook.
  4. The same input must produce the same hash so joins still work.
from pyspark.sql import functions as F

# notebookutils is a Fabric notebook built-in; no import needed.
# Retrieve salt from Key Vault, never hardcode.
salt = notebookutils.credentials.getSecret(
    "https://my-kv.vault.azure.net/", "pii-pepper-v1"
)

pii_hash = F.sha2(F.concat(F.col("email"), F.lit(salt)), 256)
df = df.withColumn("email_pseudonym", pii_hash).drop("email")

Version the salt name (pii-pepper-v1), drop the original PII column in the same transformation, and document the encoding. For GDPR right-to-be-forgotten you still need to disassociate or delete the source row; hashing reduces risk, it does not eliminate it.

Bloom filter indexes are deprecated

If you arrived here looking for hash-based data skipping in Delta, don't. As of April 2026, Bloom filter indexes are deprecated in Azure Databricks documentation, which Microsoft Fabric inherits. The current guidance:

Do not use Bloom filter indexes. Azure Databricks has deprecated this feature... They add write overhead, are difficult to tune, and are superseded by more effective alternatives.

Use V-Order (the Fabric default), Z-Order, or liquid clustering instead. If you have existing Bloom filter indexes, drop them and run VACUUM.

A production-grade SCD Type 2 pattern

Combining everything, here is what a reliable row signature looks like for Fabric Runtime 1.3:

from pyspark.sql import functions as F

NATURAL_KEY = ["customer_id"]
TRACKED_COLS = ["customer_name", "country_code", "region_code", "email_pseudonym", "tier"]
NULL_SENTINEL = "␀"
SEPARATOR = "\x1f"

def row_signature(cols):
    """SHA-256 of selected columns, NULL-safe, cross-engine reproducible."""
    return F.sha2(
        F.encode(
            F.concat_ws(
                SEPARATOR,
                *[F.coalesce(F.col(c).cast("string"), F.lit(NULL_SENTINEL)) for c in cols]
            ),
            "UTF-8"
        ),
        256
    )

incoming = source_df.withColumn("row_hash", row_signature(TRACKED_COLS))

current = (
    spark.table("silver.dim_customer")
    .filter(F.col("is_current") == True)
    .select(*NATURAL_KEY, "row_hash", "surrogate_key")
)

joined = incoming.alias("i").join(current.alias("c"), on=NATURAL_KEY, how="left")

changed = joined.filter(
    F.col("c.row_hash").isNotNull() & (F.col("i.row_hash") != F.col("c.row_hash"))
)
new = joined.filter(F.col("c.row_hash").isNull())

# Downstream MERGE handles the writes

This pattern survives NULL columns, matches HASHBYTES('SHA2_256', VARCHAR) over the same concatenated input, cannot collide at any realistic Fabric data volume, and is stable across Spark versions.

What to watch for

  1. Runtime 2.0 (Spark 4.0) Public Preview. Hash semantics stay the same across major Spark versions, but the Native Execution Engine path is being expanded to Runtime 2.0. Benchmark before migrating throughput-sensitive pipelines.
  2. VARCHAR vs NVARCHAR in Warehouse. Default string type behavior in Fabric Warehouse evolves. Spell out the encoding in your table DDL; do not rely on defaults.
  3. DAX HASH function. Still undocumented and unsupported. Until that changes, materialize hashes upstream in Spark or T-SQL.

Comments