Application retry logic is one of those areas where most teams have something, and most of what they have is slightly wrong. The two failure modes are equally bad:
- Retry too much: a non-idempotent operation runs twice, the user is double-charged, the bug report blames the database.
- Retry too little: a transient failure surfaces to the user as a fatal error, even though the database recovered immediately after.
The right pattern is to retry exactly the operations that are safe to retry, with the right backoff and the right limits. Here is the framework.
Errors that should always retry
Three Postgres error classes are unambiguously transient:
1. Connection errors. The connection dropped — network blip, database restart, pool reconfiguration. The operation never reached the database. Retrying on a fresh connection is safe regardless of idempotency.
Error codes (Postgres SQLSTATE): 08000, 08003, 08006, 08001, 08004.
2. Deadlock detected. Two transactions waited on each other; Postgres killed one. The killed transaction did not commit. Retrying is safe for any idempotent transaction.
Error code: 40P01.
3. Serialization failure. Under SERIALIZABLE or REPEATABLE READ, Postgres detected a conflict and aborted. The transaction did not commit. Retrying is safe for idempotent transactions.
Error code: 40001.
For all three, the retry pattern is the same:
def execute_with_retry(operation, max_retries=3):
for attempt in range(max_retries):
try:
return operation()
except (ConnectionError, DeadlockError, SerializationError) as e:
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt * 0.1 + random.random() * 0.05)
raise RuntimeError('unreachable')
Exponential backoff with jitter, capped at a few attempts.
Errors that should never retry
- Constraint violations (
23xxx): unique key, foreign key, check constraint. The data is wrong. Retrying produces the same error. - Syntax errors (
42xxx): the SQL is broken. Retrying will not fix it. - Permission errors (
42501): the connection does not have access. Retrying will not change that. - Out of memory (
53200): the database is in trouble. Hammering it with retries makes things worse.
These should fail fast. The application should treat them as bugs and surface them to monitoring, not retry.
The middle case: timeout errors
Statement timeouts (57014 query_canceled) are ambiguous:
- The query took too long because the database was overloaded → retry might succeed if load drops.
- The query took too long because it is fundamentally slow → retry will time out again.
My default is to NOT retry statement timeouts at the database layer. Surface them and let the application decide. If the application has rate-limiting or queueing, it can choose to retry at a higher level.
Idempotency is the precondition
The rule that trips people up: even "safe" retries (connection errors, deadlocks) are only safe for idempotent operations.
A transaction like:
BEGIN;
INSERT INTO orders (...) VALUES (...);
UPDATE inventory SET stock = stock - 1 WHERE id = X;
COMMIT;
If this fails with a connection error mid-COMMIT, you do not know whether it committed. Retrying could produce two orders. The fix is to make the operation idempotent before allowing retry.
Patterns for idempotency
1. Client-generated UUIDs. The application generates an idempotency key. The database has a UNIQUE constraint on it. The same retry produces the same UUID; the second insert fails as a constraint violation, which the application treats as success.
INSERT INTO orders (id, customer_id, total_cents)
VALUES ($1, $2, $3)
ON CONFLICT (id) DO NOTHING
RETURNING id;
The ON CONFLICT DO NOTHING makes the insert idempotent.
2. Conditional updates. Use a version column or condition that ensures the operation only applies once.
UPDATE inventory
SET stock = stock - 1
WHERE id = $1 AND last_order_id < $2;
If the same order has already been deducted, last_order_id < $2 is false and the update does nothing.
3. Idempotency tokens at the application layer. A separate table or cache that records which operations have been attempted. Retries check the token before doing the work.
INSERT INTO idempotency_keys (key, response_body)
VALUES ($1, NULL)
ON CONFLICT (key) DO NOTHING
RETURNING xmax = 0 AS inserted;
If inserted = true, the application does the work and writes the response. If inserted = false, the application returns the cached response.
Most retry-safe APIs in production use one of these patterns. Picking one and applying it consistently is more important than the specific choice.
Backoff matters
The retry timing affects behavior under load. Three patterns:
- No backoff: hammer immediately. Recovery is fast when the failure was momentary, catastrophic when the failure is sustained (DDoS yourself).
- Linear backoff: 1s, 2s, 3s. Better than no backoff, still bad under sustained failure.
- Exponential with jitter: 100ms, 200ms ± noise, 400ms ± noise. Standard, well-behaved.
The jitter is critical. Without it, all clients retry at the same instant after a brief outage, producing a thundering herd that can prolong the outage.
Limits and circuit breakers
No retry policy should be unlimited. Two limits to enforce:
- Per-operation max retries: 3-5 attempts. After that, fail and let the calling code decide.
- Time budget: a request that has been retrying for 30 seconds should give up. The user is waiting; surface the failure.
For sustained failures, a circuit breaker pattern (stop retrying entirely for some window) prevents the application from hammering a dead database.
What I see go wrong most often
The most common pattern I have to fix in code reviews:
# WRONG: catches all exceptions, retries forever
while True:
try:
result = db.execute(...)
break
except Exception:
time.sleep(1)
This catches things it should not catch (constraint violations, syntax errors), retries with no backoff, has no upper bound, and assumes the operation is idempotent.
The fix is structural — categorize errors, use exponential backoff, cap the attempts, ensure idempotency. The change is usually 20 lines of code that prevent class-action-lawsuit-shaped bugs.
A pragmatic policy
For most application code:
- Retry connection errors, deadlocks, serialization failures.
- Use exponential backoff with jitter, max 3-5 attempts.
- Ensure operations are idempotent (UUID, conditional update, or idempotency table).
- Do not retry constraint violations, syntax errors, permission errors.
- Do not retry statement timeouts at the data layer; surface them.
- Circuit-break under sustained failure.
This covers 95% of real-world cases. The remaining 5% are special: long-running batch jobs, background workers, infrastructure code. They need their own retry policy, but the principles are the same.
Retry logic is one of those areas where the cost of getting it right is small and the cost of getting it wrong is large. Worth the discipline.