6 min read

Postgres as a Job Queue: How SKIP LOCKED Changed the Game

Using a Postgres table as a job queue used to be a recipe for contention. SKIP LOCKED makes it tractable. Here is the pattern that actually scales.

The old advice was that Postgres could not be a real job queue. Workers would contend on the same row, the lock waits would pile up, throughput would suffer. Use SQS or RabbitMQ.

That advice is outdated. Postgres 9.5 added SKIP LOCKED, which makes a Postgres table behave as a competitive multi-consumer queue with surprisingly good throughput. The pattern works in production at meaningful scale. Most applications I have seen do not need a dedicated queue service.

Here is the pattern.

The table

CREATE TABLE jobs (
  id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  payload JSONB NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  scheduled_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  attempts INTEGER NOT NULL DEFAULT 0,
  last_error TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  CHECK (status IN ('pending','running','completed','failed'))
);

-- Partial index for the only query that matters at scale
CREATE INDEX jobs_pending_scheduled
  ON jobs (scheduled_at)
  WHERE status = 'pending';

The partial index keeps it cheap to scan: only pending rows are in the index. Completed/failed rows are invisible to the query.

The dequeue query

WITH next_job AS (
  SELECT id
  FROM jobs
  WHERE status = 'pending'
    AND scheduled_at <= now()
  ORDER BY scheduled_at
  FOR UPDATE SKIP LOCKED
  LIMIT 1
)
UPDATE jobs
SET status = 'running', attempts = attempts + 1
FROM next_job
WHERE jobs.id = next_job.id
RETURNING jobs.id, jobs.payload;

The magic is FOR UPDATE SKIP LOCKED. Without it, two workers competing for the same row would queue up — one would wait for the other to commit. With it, the second worker sees "this row is locked" and skips to the next. Each worker grabs a different job; no contention.

The WITH ... AS (SELECT ... FOR UPDATE SKIP LOCKED LIMIT 1) followed by UPDATE ensures the worker gets exclusive access to its job and atomically marks it as running.

The completion query

UPDATE jobs
SET status = 'completed'
WHERE id = $1;

Worker finishes the job, marks it completed. Simple.

For failures with retry:

UPDATE jobs
SET status = CASE
      WHEN attempts >= 5 THEN 'failed'
      ELSE 'pending'
    END,
  scheduled_at = now() + (interval '1 minute' * power(2, attempts)),
  last_error = $2
WHERE id = $1;

Failed jobs go back to pending with exponential backoff up to 5 attempts, then are marked permanently failed.

Throughput

On modest hardware (4-core RDS instance), I have measured this pattern at 2,000-5,000 jobs/second across 20 workers. That is enough for most applications. For higher throughput, the bottleneck is usually the application logic doing the actual job, not the queue dequeue rate.

For reference, dedicated queue services (SQS, Kafka) can push much higher throughput, but most applications never need it.

Cleanup is essential

Over time, completed and failed jobs accumulate. The partial index keeps the queue fast, but the table grows. Two options:

Periodic cleanup:

DELETE FROM jobs
WHERE status IN ('completed', 'failed')
  AND created_at < now() - interval '30 days';

Run this nightly. If the table is large and the cleanup is too slow, do it in batches:

DELETE FROM jobs
WHERE id IN (
  SELECT id FROM jobs
  WHERE status IN ('completed', 'failed')
    AND created_at < now() - interval '30 days'
  LIMIT 10000
);

Loop until no rows are deleted.

Partitioning by date:

For very high-volume queues, partition by week or month. Old partitions can be dropped instantly with DROP PARTITION, no DELETE scan needed.

The visibility timeout question

Dedicated queues like SQS have a visibility timeout — if a worker grabs a message and crashes, the message becomes visible again after some time. In our pattern, a worker that grabs a job and crashes leaves the job in running state forever.

The fix is a stale-job cleanup:

UPDATE jobs
SET status = 'pending', attempts = attempts + 1
WHERE status = 'running'
  AND scheduled_at < now() - interval '5 minutes';

Run this periodically (every minute is fine). Any job that has been "running" for more than 5 minutes (without being completed) is presumed dead and re-queued.

For variable-duration jobs, the timeout has to match the longest legitimate runtime. Long-running jobs need a heartbeat where they UPDATE jobs SET scheduled_at = now() + interval '5 minutes' periodically to extend the timeout.

Priority and ordering

For priority queues, add a priority INTEGER column and order by it:

-- Higher priority first
ORDER BY priority DESC, scheduled_at

For delayed jobs ("run this in 1 hour"), set scheduled_at to the future time. The dequeue query already filters scheduled_at <= now().

When to reach for a dedicated queue

The Postgres-as-queue pattern is great for most applications. It stops being the right choice when:

  • Throughput exceeds a few thousand jobs per second sustained.
  • The queue logic is more complex than priority + scheduled_at (e.g., topic-based fanout, complex routing).
  • The database is already under pressure and you do not want queue traffic adding to it.
  • Your team genuinely benefits from the operational ergonomics of a dedicated queue (DLQs, message TTLs, web UIs).

For those, SQS, RabbitMQ, or Kafka. For everything else, Postgres + SKIP LOCKED is fine and one fewer system to operate.

What I recommend

For a typical web application that needs background jobs:

  1. Start with Postgres + SKIP LOCKED.
  2. Use the table schema above.
  3. Add stale-job cleanup.
  4. Add periodic deletion of completed/failed jobs.
  5. Monitor the size of the pending-jobs index.
  6. Reach for dedicated queue only when you have a specific reason.

The pattern is well-understood, well-documented in tools like Sidekiq's Postgres backend, GoodJob (Rails), and equivalents in other ecosystems. The decision to use Postgres as the queue is a one-line config change in many of these libraries; it's worth defaulting to it.