recall

← recall

hot partition pattern

Most partitions are fine, but one partition gets a disproportionate share of traffic. That one shard becomes the bottleneck for the whole system regardless of how much capacity you have elsewhere.

Most partitions are fine, but one partition gets a disproportionate share of traffic. That one shard becomes the bottleneck for the whole system regardless of how much capacity you have elsewhere.

symptoms

  • one shard at high CPU/IO while others idle
  • throttling on a small subset of keys
  • scaling the cluster doesn't help because the hot partition is the bottleneck

causes

  • partition key with skewed distribution (e.g. tenant_id where one tenant is huge)
  • time-based partition keys (everyone writes 'today')
  • celebrity / popular item access

fixes

  • repartition with higher-cardinality key
  • salt / shard-suffix the hot key
  • cache the hot keys
  • request hedging for read hotspots

you might say

  • we've got a hot key
  • this one shard is melting
  • the partition key isn't distributing well

related

aliases: hot shard, hot key, hotspot

topics: partitioning, scaling, failure-modes

references: