Background Jobs and Queues That Don't Drop Work
How to build background job systems that survive crashes, retries, and duplicates: idempotency, dead-letter queues, and failure visibility.
The work you cannot afford to lose
Background jobs and queues are how a web application stays responsive while the slow, heavy, or unreliable work happens out of band — sending the email, charging the card, generating the report, calling the flaky third-party API. The user gets an instant response; the real work is handed to a worker to finish later. The catch is that out of band also means out of sight, and a naive queue silently drops work the first time a worker crashes mid-task or a network call times out. Building a job system that does not lose work is mostly about planning for the failures that are guaranteed to happen.
Assume every job runs at least once, maybe more
The foundational design decision is delivery semantics. Exactly-once delivery is famously hard to guarantee in a distributed system, so almost every practical queue offers at-least-once: it promises your job will run, and accepts that it might run more than once — because a worker can finish the work and then crash before acknowledging it, leaving the queue to redeliver. This is not a flaw to engineer around; it is the contract. The right response is to make your jobs safe to run twice.
Design for at-least-once delivery and you will sleep through the redeliveries. Design for exactly-once and the first crash will wake you.
Idempotency is the whole game
A job is idempotent when running it twice has the same effect as running it once. Get this right and at-least-once delivery becomes a non-event; get it wrong and a single redelivery double-charges a customer. The common techniques:
- Idempotency keys. Give each unit of work a stable identifier and record completed keys, so a redelivered job sees its key is done and exits cleanly.
- Natural idempotency. Prefer set the status to shipped over increment the counter — operations that converge on the same end state no matter how many times they run.
- Upserts over blind inserts so a re-run updates the existing row instead of creating a duplicate.
The discipline is to ask of every job: if this runs twice, what breaks? Then close that gap before it ships, not after the duplicate charge.
Retry with backoff, and know when to stop
Transient failures — a brief network blip, a momentarily overloaded dependency — are normal, and the answer is to retry. But retrying immediately and forever is how a small outage becomes a self-inflicted denial of service against your own dependency. Retry with exponential backoff so each attempt waits longer than the last, add a little randomness so a fleet of workers does not retry in lockstep, and cap the number of attempts. Some failures are permanent — malformed data, a deleted record — and retrying them is pure waste; detect those and fail fast rather than burning the full retry budget on a job that will never succeed.
Catch the failures in a dead-letter queue
A job that exhausts its retries must not simply vanish. The dead-letter queue is where these go — a separate holding area for work that failed permanently, preserving the job and the reason it died. Without one, your failures are invisible and you find out about them from an angry customer; with one, you have a queryable record of exactly what broke and why, and the ability to fix the cause and replay the jobs once it is resolved. Pair it with metrics on queue depth and job age so a backlog or a wave of failures shows up on a dashboard before it shows up in support tickets.
Stop one slow job from starving the rest
A subtle failure mode is not the job that crashes but the job that succeeds slowly, or the flood of jobs that arrives at once. A single queue worked by a shared pool of workers has a hidden coupling: a surge of slow report-generation jobs can hold every worker, and the password-reset emails behind them simply wait. The user who crashed out of your app is now also not getting their reset email, for reasons that have nothing to do with email. Isolation is the fix.
- Separate queues by priority or workload type, so a backlog of heavy batch work cannot block latency-sensitive jobs like notifications.
- Give long-running jobs their own worker pool, keeping the fast lane clear for quick, user-facing work.
- Set a sensible timeout on every job, so one stuck task is reaped and retried rather than occupying a worker indefinitely.
Reliability is not only about jobs that fail. It is about making sure a slow job in one corner cannot quietly take down the work that users are actually waiting on.
Capacity planning closes the loop: watch how fast jobs arrive against how fast you drain them, and scale workers before the backlog grows faster than you can clear it. A queue that only ever fills up is not absorbing bursts — it is hiding the moment you ran out of capacity.
How BSH can help
BSH Technologies builds background processing that holds onto the work it is given — idempotent jobs, sane retry-and-backoff policies, dead-letter capture, and the visibility to see failures before customers do. We have rescued queue systems that were quietly dropping payments and notifications, and rebuilt them to be boringly reliable. If your async jobs are a source of mystery bugs and lost work, let us help you make them dependable.
From the blog
View all postsDesigning Multi-Tenant SaaS That Scales
Choosing an isolation model, keeping tenant data separate, and dodging the noisy-neighbour and migration traps that bite SaaS later.
Hitting Green Core Web Vitals in Next.js
A practical guide to LCP, INP and CLS in Next.js — image handling, font loading, the App Router boundary, and costly third-party scripts.