6 min read

Batch Jobs Against Postgres: Production Traffic With Worse Manners

A batch job that worked fine in dev can saturate production. The fixes are not exotic — chunk size, lock duration, retry budget — but most teams skip them.

A batch job is just production traffic that arrives in a less-friendly shape. Instead of 1,000 small transactions over an hour, it is one big operation hitting the database all at once. The dev environment runs it in 30 seconds and looks fine. Production runs it during peak hours and locks tables for 10 minutes.

The fixes are not complicated. They just require the discipline to apply them.

The four common patterns that fail

1. The single big transaction.

BEGIN;
UPDATE users SET status = 'archived' WHERE last_login < now() - interval '2 years';
COMMIT;

For 100,000 users, this is a single transaction that holds locks on 100,000 rows for the duration. Other queries that touch the users table queue up.

Fix: chunk the work.

DO $$
DECLARE
  rows_updated INTEGER;
BEGIN
  LOOP
    UPDATE users
    SET status = 'archived'
    WHERE id IN (
      SELECT id FROM users
      WHERE last_login < now() - interval '2 years'
        AND status != 'archived'
      LIMIT 1000
    );
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    PERFORM pg_sleep(0.1);  -- give vacuum time to keep up
  END LOOP;
END $$;

Each iteration is a small transaction. Locks are held briefly. Vacuum can keep up with bloat. Other traffic flows.

2. The unindexed scan.

A batch query that does a full table scan because of an unindexed predicate. On a 50M-row table, the scan is the entire job's runtime.

Fix: index the predicate, or rewrite the query to use existing indexes.

-- Slow: unindexed predicate
UPDATE orders SET status = 'expired' WHERE status = 'pending' AND created_at < now() - interval '30 days';

-- Faster: composite index
CREATE INDEX CONCURRENTLY orders_pending_created
  ON orders (created_at)
  WHERE status = 'pending';

The partial index covers exactly the rows the batch needs.

3. The cascading FK update.

Deleting a customer that has 10,000 orders triggers cascading deletes through orders, order_items, payments, etc. A single DELETE turns into thousands of operations under the hood.

Fix: batch the cascading delete manually.

-- Delete order_items first, in chunks
DELETE FROM order_items
WHERE order_id IN (
  SELECT id FROM orders WHERE customer_id = $1 LIMIT 1000
);
-- Repeat until no rows deleted

-- Then orders
DELETE FROM orders WHERE customer_id = $1;

-- Finally the customer
DELETE FROM customers WHERE id = $1;

More code, much less lock pressure on each step.

4. The job that doesn't survive its own failure.

A batch job that processes 100,000 records crashes after 50,000. Restarting starts from 1, processes the first 50,000 again. Either duplicates work or fails on already-processed records.

Fix: idempotent batches with progress tracking.

-- Track what's been processed
UPDATE users SET processed_at = now()
WHERE id IN (
  SELECT id FROM users
  WHERE processed_at IS NULL
  ORDER BY id
  LIMIT 1000
);

The predicate processed_at IS NULL ensures only unprocessed rows are touched. A crash mid-batch resumes from where it left off.

Lock timing

For batch jobs that take exclusive locks (DDL, schema changes, large updates), the lock_timeout setting prevents the job from waiting forever and blocking everything behind it:

SET lock_timeout = '5s';
ALTER TABLE users ...;

If the lock cannot be acquired in 5 seconds, the job aborts. This is much better than waiting an hour and blocking dozens of queries in the queue.

For batch retries, combine with statement_timeout:

SET lock_timeout = '5s';
SET statement_timeout = '60s';

The statement either runs in a minute or fails fast.

Throttling

Most batch jobs have soft deadlines. "Archive 100,000 inactive users by tomorrow morning" — that is 24 hours of budget. Spreading the work over 24 hours costs the application much less than running it in 30 minutes.

The pattern:

for batch in chunks(records, 1000):
    process_batch(batch)
    time.sleep(2)  # 2-second pause between batches

During the sleep, application traffic gets full database resources. The batch only consumes when actively processing.

For more sophisticated throttling, look at the database's load (pg_stat_database query duration, connection count) and adjust the sleep duration based on load. The batch slows down during peak, speeds up off-peak.

Read replicas for batch reads

If the batch is read-only and the staleness is acceptable, run it on a read replica:

# Connect to replica for read
data = read_replica.execute("SELECT ... heavy aggregate ...")

# Apply results to primary
primary.execute("INSERT INTO results ...", data)

The primary is unaffected by the batch's read load. The replica may experience lag, but only briefly.

This is the standard pattern for analytical batch jobs.

Monitoring

For any batch job in production:

  • Log start, progress, completion, errors. With timestamps.
  • Track rows processed per batch and total.
  • Monitor database metrics during the batch — is connection count climbing, are locks queuing.
  • Alert if the batch runs longer than expected (timeout).

Without monitoring, batch jobs are black boxes. The first sign of trouble is the database alerting on saturation.

A real story

A team I worked with had a daily batch that recalculated customer lifetime value for 5 million customers. The job was a single SQL statement:

UPDATE customers
SET lifetime_value = (
  SELECT sum(amount) FROM orders WHERE customer_id = customers.id
);

This took 4 hours, locked the customers table for the duration, and caused user-facing latency spikes during the run.

Fix:

  1. Materialized view for the aggregate, refreshed in chunks.
  2. Batch update from the materialized view, 10,000 customers per chunk, 1-second pauses.
  3. Run during off-peak window.

New runtime: 90 minutes. No user-facing impact.

The code change was 30 lines. The throughput improvement was a non-issue (the batch had a 24-hour budget). The user-impact improvement was the whole win.

What I commit to for any batch job

  1. Chunked into small transactions (typically 1k-10k rows per chunk).
  2. Indexed predicate so the chunk query is fast.
  3. Idempotent — safe to re-run.
  4. Throttled with sleeps between chunks.
  5. Lock and statement timeouts set.
  6. Monitored from start to finish.
  7. Scheduled off-peak unless there is a reason to do otherwise.

This is more disciplined than the typical batch job. It is also the difference between batch jobs that go unnoticed and batch jobs that page someone.