Cryptographic Hash Function Collisions: From MD5 to SHA-3 Security Analysis

Introduction

Cryptographic hash functions are fundamental to digital security, yet finding collisions has proven devastating for several widely-deployed algorithms. Understanding collision attacks is essential for secure system design and hash function selection.

Hash Function Security Properties

A cryptographic hash H: {0,1} → {0,1}^n* must satisfy:

Pre-image Resistance

Given h, computationally infeasible to find m such that H(m) = h
Security requirement: 2^n operations

2. Second Pre-image Resistance

Given m₁, infeasible to find m₂ ≠ m₁ such that H(m₁) = H(m₂)
Security requirement: 2^n operations

3. Collision Resistance

Infeasible to find any m₁ ≠ m₂ such that H(m₁) = H(m₂)
Security requirement: 2^(n/2) operations (birthday bound)

Birthday Attack

Theoretical Foundation

Birthday Paradox: With 23 random people, P(shared birthday) > 50%

Generalized: Drawing √(2N) samples from N-element set gives collision probability ≈ 50%.

For n-bit hash:

N = 2^n possible hashes
√(2·2^n) = √2 · 2^(n/2) ≈ 2^(n/2) hashes needed

Generic Algorithm

def birthday_attack(hash_func, output_bits):
    """
    Generic collision finder
    Expected work: 2^(output_bits/2)
    """
    import random

    seen = {}  # Hash table: hash -> message
    attempts = 0

    while True:
        # Generate random message
        msg = random.randbytes(64)
        h = hash_func(msg)

        if h in seen and seen[h] != msg:
            # Collision found!
            return (seen[h], msg, h)

        seen[h] = msg
        attempts += 1

        # Expected attempts ≈ √(π/2 · 2^n) ≈ 1.25 · 2^(n/2)
        if attempts % 1000000 == 0:
            print(f"Attempts: {attempts:,}")

Complexity: O(2^(n/2)) time, O(2^(n/2)) space

Limits practical security:

64-bit hash: ~2^32 = 4 billion hashes
128-bit hash: ~2^64 = 18 quintillion hashes
256-bit hash: ~2^128 hashes (infeasible)

MD5 Collision Attacks

Algorithm Structure

MD5 processes 512-bit blocks using 4 rounds of 16 steps each:

def md5_round(A, B, C, D, M, s, t, func):
    """
    Single MD5 step
    func ∈ {F, G, H, I} for rounds 1-4
    """
    # Round functions
    F = lambda x,y,z: (x & y) | (~x & z)
    G = lambda x,y,z: (x & z) | (y & ~z)
    H = lambda x,y,z: x ^ y ^ z
    I = lambda x,y,z: y ^ (x | ~z)

    temp = (A + func(B, C, D) + M + t) & 0xFFFFFFFF
    temp = rotate_left(temp, s)
    A = (B + temp) & 0xFFFFFFFF

    return A

Differential Cryptanalysis

Wang et al. (2004) discovered differential path:

ΔM → ΔH = 0

where ΔM is carefully crafted message difference

Key Insight: Certain message differences propagate predictably through MD5 rounds.

Practical Collision (2004)

# Collision found by Wang et al.
M1 = bytes.fromhex(
    "d131dd02c5e6eec4693d9a0698aff95c"
    "2fcab58712467eab4004583eb8fb7f89"
    # ... (full 128 bytes)
)

M2 = bytes.fromhex(
    "d131dd02c5e6eec4693d9a0698aff95c"
    "2fcab50712467eab4004583eb8fb7f89"  # Differs in byte 18
    # ... (full 128 bytes)
)

assert md5(M1) == md5(M2)
assert M1 != M2

Computation: ~1 hour on 2004 hardware Modern hardware: Seconds

Chosen-Prefix Collisions

More dangerous: Control prefix of both messages:

def chosen_prefix_collision(prefix1, prefix2):
    """
    Find M1, M2 such that:
    H(prefix1 || M1) = H(prefix2 || M2)
    """
    # 1. Compute hash state after prefixes
    state1 = md5_state(prefix1)
    state2 = md5_state(prefix2)

    # 2. Find collision block
    collision = find_collision_block(state1, state2)

    return collision

Impact: Forge certificates, documents with same hash but different content

Timeline:

2007: Chosen-prefix in ~1 day
2012: Chosen-prefix in ~10 hours
2019: Chosen-prefix in ~10 seconds

SHA-1 Collision Attacks

Algorithm Differences

SHA-1 improvements over MD5:

160-bit output (vs 128-bit)
80 rounds (vs 64)
Message expansion
Better mixing functions

Theoretical Attack (2005)

Wang et al. showed theoretical complexity:

2^69 operations (vs 2^80 birthday bound)

Not practical until 2017.

SHAttered Attack (2017)

Google/CWI demonstrated first practical collision:

PDF prefix technique:
- Identical prefix (header)
- Two different JPEG images
- Crafted collision blocks
- Identical suffix

Resources:

6,500 CPU-years
110 GPU-years
$110,000 in cloud computing

# Simplified collision structure
def create_sha1_collision():
    """
    Structure of SHAttered PDFs
    """
    prefix = create_pdf_header()

    # Image 1: Blue square
    image1 = jpeg_blue_square()

    # Image 2: Red square
    image2 = jpeg_red_square()

    # Find collision blocks C1, C2
    C1, C2 = shattered_collision_blocks()

    pdf1 = prefix + C1 + image1
    pdf2 = prefix + C2 + image2

    assert sha1(pdf1) == sha1(pdf2)
    assert pdf1 != pdf2

    return (pdf1, pdf2)

Result: Two valid PDFs displaying different images with identical SHA-1 hash.

Chosen-Prefix for SHA-1 (2020)

Leurent & Peyrin reduced cost:

Computation: ~10^18 operations
Cost: ~$100,000 (2020)
Timeline: ~2 months on 900 GPUs

Impact: Theoretical PGP/GnuPG key impersonation.

SHA-2 Security Analysis

Family Overview

Variant	Output	Block	Rounds	Status
SHA-224	224-bit	512	64	Secure
SHA-256	256-bit	512	64	Secure
SHA-384	384-bit	1024	80	Secure
SHA-512	512-bit	1024	80	Secure

Best Known Attacks

SHA-256: No collision attacks better than birthday.

Best reduced-round attacks:

31/64 rounds: 2^65.5 (theoretical, not practical)
46/64 rounds: Pseudo-collision in compression function
Full 64 rounds: No attack

Security margin: ~31 rounds (48% of total)

Implementation

def sha256_compression(state, block):
    """
    SHA-256 compression function
    Processes 512-bit block
    """
    # Initialize working variables
    a, b, c, d, e, f, g, h = state

    # Message schedule
    W = [0] * 64
    for t in range(16):
        W[t] = int.from_bytes(block[t*4:(t+1)*4], 'big')

    for t in range(16, 64):
        s0 = ror(W[t-15], 7) ^ ror(W[t-15], 18) ^ (W[t-15] >> 3)
        s1 = ror(W[t-2], 17) ^ ror(W[t-2], 19) ^ (W[t-2] >> 10)
        W[t] = (W[t-16] + s0 + W[t-7] + s1) & 0xFFFFFFFF

    # 64 rounds
    for t in range(64):
        S1 = ror(e, 6) ^ ror(e, 11) ^ ror(e, 25)
        ch = (e & f) ^ (~e & g)
        temp1 = (h + S1 + ch + K[t] + W[t]) & 0xFFFFFFFF

        S0 = ror(a, 2) ^ ror(a, 13) ^ ror(a, 22)
        maj = (a & b) ^ (a & c) ^ (b & c)
        temp2 = (S0 + maj) & 0xFFFFFFFF

        h, g, f, e, d, c, b, a = \
            g, f, e, (d+temp1)&0xFFFFFFFF, c, b, a, (temp1+temp2)&0xFFFFFFFF

    # Add to state
    return [(x + y) & 0xFFFFFFFF for x, y in
            zip([a,b,c,d,e,f,g,h], state)]

SHA-3 (Keccak)

Sponge Construction

SHA-3 uses different paradigm:

Sponge[f, pad, r](M) = Absorb → Squeeze

Absorb: XOR input blocks into state
Squeeze: Output extracted from state

def keccak_sponge(message, rate, capacity, output_len):
    """
    SHA-3 sponge construction
    State: 1600 bits = rate + capacity
    """
    state = [0] * 25  # 25 × 64-bit lanes

    # Absorbing phase
    blocks = pad_message(message, rate)
    for block in blocks:
        for i in range(rate // 64):
            state[i] ^= block[i]
        state = keccak_f(state)  # Permutation

    # Squeezing phase
    output = []
    while len(output) < output_len:
        output.extend(state[:rate // 64])
        state = keccak_f(state)

    return output[:output_len]

Security Analysis

SHA3-256 parameters:

Rate (r): 1088 bits
Capacity (c): 512 bits (2× output for security)
Rounds: 24

Best attacks: No practical collision attacks

Reduced-round analysis:

8/24 rounds: 2^384 complexity (theoretical)
Full 24 rounds: No attacks

Security margin: 16 rounds (67% of total)

Practical Implications

Certificate Forgery

MD5/SHA-1 collisions enable certificate attacks:

def forge_certificate():
    """
    Theoretical MD5 collision attack on certificates
    (Prevented by modern CAs)
    """
    # 1. Create valid CSR (Certificate Signing Request)
    legit_csr = create_csr("example.com")

    # 2. Create malicious CSR with same MD5
    evil_csr = create_csr("*.com", collision_with=legit_csr)

    # 3. Submit legit_csr to CA
    cert = ca_sign(legit_csr)  # Uses MD5 (old CAs)

    # 4. Replace CSR content with evil_csr
    # Cert signature still valid due to MD5 collision!

    return forged_cert

Mitigations:

CAs banned MD5 in 2011
SHA-1 banned in 2016
Certificates now use SHA-256+

Git Commits

Git uses SHA-1 for commit IDs:

# Potential attack (theoretical)
git commit -m "Benign change"
# Commit: abc123...

# Create collision
malicious_commit = create_sha1_collision(abc123)

# Replace commit object
# Same SHA-1, different content!

Mitigation: Git transitioning to SHA-256 (2020+)

Migration Recommendations

Algorithm Selection (2024)

Use Case	Recommended	Avoid
Digital signatures	SHA-256, SHA-512	MD5, SHA-1
File integrity	SHA3-256, BLAKE3	MD5
Certificates	SHA-256	SHA-1
Password hashing	Argon2, bcrypt	Any fast hash
Git/VCS	SHA-256	SHA-1

Safe Hash Usage

# INSECURE: Direct hash comparison
def verify_download_bad(file_data, expected_md5):
    return md5(file_data) == expected_md5

# SECURE: Use HMAC with secret key
def verify_download_good(file_data, key, expected_mac):
    mac = hmac_sha256(key, file_data)
    return constant_time_compare(mac, expected_mac)

# SECURE: Use collision-resistant hash
def verify_download_better(file_data, expected_sha256):
    return sha256(file_data) == expected_sha256

Conclusion

Collision resistance is paramount for hash function security. MD5 and SHA-1 are completely broken for collision resistance and must not be used in security-critical contexts.

Modern systems should exclusively use SHA-256/SHA-512 or SHA-3 family. For new designs, consider BLAKE3 for performance-critical applications.

The cryptographic community’s response to hash function breaks demonstrates the importance of:

Conservative security margins
Diversity in cryptographic primitives
Proactive migration before attacks become practical

References

Wang, X., et al. (2005). “Finding Collisions in the Full SHA-1”
Stevens, M., et al. (2017). “The first collision for full SHA-1”
Leurent, G., Peyrin, T. (2020). “SHA-1 is a Shambles”
NIST (2015). “SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions”