Skip to content

[PG 9.5+] Feature Request: Use FOR UPDATE SKIP LOCKED for Job Acquisition #811

@luke-stdev001

Description

@luke-stdev001

Is your feature request related to a problem?

Yes, the current queue_job implementation uses PostgreSQL's LISTEN/NOTIFY mechanism for job coordination, which unfortunately has scaling issues:

  1. Lock Contention at Scale: LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. The mechanism applies database-wide locks that become a bottleneck with many workers.

When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (or to be pedantic, a global lock on all databases within the postgres instance - https://github.com/postgres/postgres/blob/a749c6f18fbacd05f432cd29f9e7294033bc666f/src/backend/commands/async.c#L956) during the commit phase of the transaction, effectively serializing all commits. Under many concurrent writers, this results in immense load and major downtime.

Reference:
https://www.recall.ai/blog/postgres-listen-notify-does-not-scale

  1. Race Conditions: The current implementation experiences race conditions where "two jobrunners end up in a race condition, competing to handle the same tasks listed in the queue job table" (see odoo.sh deployment issues).

  2. Concurrency Errors: Users report frequent TransactionRollbackError: could not serialize access due to concurrent update errors when multiple workers attempt to process jobs.

  3. Complex Workarounds Required: This scalability issue has necessitated complex workarounds like:

Describe the solution you'd like

Implement job acquisition using PostgreSQL's FOR UPDATE SKIP LOCKED clause, following the pattern successfully used by DBOS, Solid Queue, PG Boss, and other queue systems. The implementation should:

  1. Replace the core job acquisition query with something similar to:
UPDATE queue_job 
SET state = 'started',
    started_at = now(),
    worker_id = %(worker_id)s
WHERE id IN (
    SELECT id 
    FROM queue_job 
    WHERE state = 'pending'
    AND scheduled_at <= now()
    ORDER BY priority, created_at
    LIMIT %(batch_size)s
    FOR UPDATE SKIP LOCKED
)
RETURNING *;
  1. Maintain NOTIFY for job insertion (optional) - NOTIFY can still be used to wake up idle workers when new jobs arrive, but not for coordination.

Describe alternatives you've considered

Hybrid approach: Use SKIP LOCKED for job acquisition but keep LISTEN/NOTIFY for real-time notifications. This provides the best of both worlds.

Additional context

Specific DBOS Implementation Details Providing An Example Of A Highly Scaleable Implementation Built On A Solid Foundation

Based on Qian Li's insights and the DBOS Transact implementation (https://github.com/dbos-inc/dbos-transact-py):

1. FOR UPDATE SKIP LOCKED Pattern

DBOS explains their approach in their blog post "Why Postgres is a Good Choice for Durable Workflow Execution" (https://www.dbos.dev/blog/why-postgres-durable-execution):

"Postgres provides a solution: locking clauses. Selecting rows in this way does two things. First, it locks the rows so that other workers cannot also select them. Second, it skips rows that are already locked, selecting not the N oldest enqueued workflows, but the N oldest enqueued workflows that are not already locked by another worker."

This allows "many workers to concurrently pull new workflows without contention. One worker selects the oldest N workflows and locks them, the second worker selects the next oldest N workflows and locks those, and so on."

2. DBOS Schema Design

According to their release notes, "DBOS merged dbos.workflow_inputs and dbos.workflow_queue tables into a unified dbos.workflow_status table" for better performance. This unified approach reduces joins and improves query performance when using FOR UPDATE SKIP LOCKED.

3. Performance Benefits

DBOS reports that their approach enables "a durable workflow system to process tens of thousands of workflows per second across thousands of workers" by greatly reducing contention.

Performance Evidence

  • QueueClassic benchmarks showed ~2.5x performance improvement when switching to SKIP LOCKED (from 929 jobs/sec to 2352 jobs/sec). Not an apples to apples comparison, but worth noting.
  • SKIP LOCKED allows transactions to claim jobs without ever blocking each other
  • Other battle-tested implementations such as Solid Queue, PG Boss, and DBOS all rely on this pattern

Implementation References

Suggested Migration Path

  1. Add a feature flag to enable SKIP LOCKED mode
  2. Implement the new acquisition query alongside the existing one
  3. Utilise existing exponential backoff with jitter feature for polling intervals
  4. Provide migration documentation for existing deployments

PostgreSQL Version Requirement

SKIP LOCKED is available in PostgreSQL 9.5+, which has been available since 2016. Given that PostgreSQL 9.4 reached EOL in 2020, this should not be a barrier.

Expected Benefits

  • 10-100x better scalability under high worker count. DBOS has run this on the MIT SuperCloud and tested this running across many 10's of thousands of cores: https://people.eecs.berkeley.edu/~matei/papers/2022/cidr_dbos.pdf
  • Elimination of race conditions and concurrent update errors
  • Simplified codebase removing the need for some more complex locking mechanisms
  • Better compatibility with cloud deployments and container orchestration
  • Support for true multi-node deployments without advisory lock complications

This change would bring OCA/queue in line with other battle-tested queue implementations and resolve some scalability bottlenecks.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions