[PG 9.5+] Feature Request: Use FOR UPDATE SKIP LOCKED for Job Acquisition

**Is your feature request related to a problem?**

Yes, the current queue_job implementation uses PostgreSQL's LISTEN/NOTIFY mechanism for job coordination, which unfortunately has scaling issues:

1. **Lock Contention at Scale**: LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. The mechanism applies database-wide locks that become a bottleneck with many workers.

When a `NOTIFY` query is issued during a transaction, it acquires a global lock on the entire database (or to be pedantic, a global lock on all databases within the postgres instance - https://github.com/postgres/postgres/blob/a749c6f18fbacd05f432cd29f9e7294033bc666f/src/backend/commands/async.c#L956) during the commit phase of the transaction, effectively serializing all commits. Under many concurrent writers, this results in immense load and major downtime.

**Reference:**
[https://www.recall.ai/blog/postgres-listen-notify-does-not-scale](https://www.recall.ai/blog/postgres-listen-notify-does-not-scale)

2. **Race Conditions**: The current implementation experiences race conditions where "two jobrunners end up in a race condition, competing to handle the same tasks listed in the queue job table" (see odoo.sh deployment issues).

3. **Concurrency Errors**: Users report frequent `TransactionRollbackError: could not serialize access due to concurrent update` errors when multiple workers attempt to process jobs.

4. **Complex Workarounds Required**: This scalability issue has necessitated complex workarounds like:
   - Advisory locks for multi-node coordination (PR #256)
   - Alternative implementations like queue_job_cron_jobrunner
   - Limitations on multi-worker deployments

**Describe the solution you'd like**

Implement job acquisition using PostgreSQL's `FOR UPDATE SKIP LOCKED` clause, following the pattern successfully used by DBOS, Solid Queue, PG Boss, and other queue systems. The implementation should:

1. **Replace the core job acquisition query** with something similar to:
```sql
UPDATE queue_job 
SET state = 'started',
    started_at = now(),
    worker_id = %(worker_id)s
WHERE id IN (
    SELECT id 
    FROM queue_job 
    WHERE state = 'pending'
    AND scheduled_at <= now()
    ORDER BY priority, created_at
    LIMIT %(batch_size)s
    FOR UPDATE SKIP LOCKED
)
RETURNING *;
```

2. **Maintain NOTIFY for job insertion** (optional) - NOTIFY can still be used to wake up idle workers when new jobs arrive, but not for coordination.

**Describe alternatives you've considered**

**Hybrid approach**: Use SKIP LOCKED for job acquisition but keep LISTEN/NOTIFY for real-time notifications. This provides the best of both worlds.

**Additional context**

### Specific DBOS Implementation Details Providing An Example Of A Highly Scaleable Implementation Built On A Solid Foundation

Based on Qian Li's insights and the DBOS Transact implementation (https://github.com/dbos-inc/dbos-transact-py):

#### 1. FOR UPDATE SKIP LOCKED Pattern

DBOS explains their approach in their blog post "Why Postgres is a Good Choice for Durable Workflow Execution" (https://www.dbos.dev/blog/why-postgres-durable-execution):

**"Postgres provides a solution: locking clauses. Selecting rows in this way does two things. First, it locks the rows so that other workers cannot also select them. Second, it skips rows that are already locked, selecting not the N oldest enqueued workflows, but the N oldest enqueued workflows that are not already locked by another worker."**

This allows "many workers to concurrently pull new workflows without contention. One worker selects the oldest N workflows and locks them, the second worker selects the next oldest N workflows and locks those, and so on."

#### 2. DBOS Schema Design

According to their release notes, "DBOS merged dbos.workflow_inputs and dbos.workflow_queue tables into a unified dbos.workflow_status table" for better performance. This unified approach reduces joins and improves query performance when using FOR UPDATE SKIP LOCKED.

#### 3. Performance Benefits

DBOS reports that their approach enables "a durable workflow system to process tens of thousands of workflows per second across thousands of workers" by greatly reducing contention.

### Performance Evidence

- QueueClassic benchmarks showed **~2.5x performance improvement** when switching to SKIP LOCKED (from 929 jobs/sec to 2352 jobs/sec). Not an apples to apples comparison, but worth noting.
- SKIP LOCKED allows transactions to claim jobs without ever blocking each other
- Other battle-tested implementations such as Solid Queue, PG Boss, and DBOS all rely on this pattern

### Implementation References

- **DBOS Transact**: https://github.com/dbos-inc/dbos-transact-py - Successfully uses SKIP LOCKED for their durable queue implementation
- **DBOS Blog on Postgres Queues**: https://www.dbos.dev/blog/why-postgres-durable-execution
- **PQ (Python Queue)**: Uses SKIP LOCKED and achieves ~1000 ops/second with the pattern
- **Solid Queue**: Rails' modern queue system built on SKIP LOCKED
- **Inferable's detailed explanation**: https://www.inferable.ai/blog/posts/postgres-skip-locked

### Suggested Migration Path

1. Add a feature flag to enable SKIP LOCKED mode
2. Implement the new acquisition query alongside the existing one
3. Utilise existing exponential backoff with jitter feature for polling intervals
4. Provide migration documentation for existing deployments

### PostgreSQL Version Requirement

SKIP LOCKED is available in PostgreSQL 9.5+, which has been available since 2016. Given that PostgreSQL 9.4 reached EOL in 2020, this should not be a barrier.

### Expected Benefits

- **10-100x better scalability** under high worker count. DBOS has run this on the MIT SuperCloud and tested this running across many 10's of thousands of cores: https://people.eecs.berkeley.edu/~matei/papers/2022/cidr_dbos.pdf
- **Elimination of race conditions** and concurrent update errors
- **Simplified codebase** removing the need for some more complex locking mechanisms
- **Better compatibility** with cloud deployments and container orchestration
- **Support for true multi-node deployments** without advisory lock complications

This change would bring OCA/queue in line with other battle-tested queue implementations and resolve some scalability bottlenecks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[PG 9.5+] Feature Request: Use FOR UPDATE SKIP LOCKED for Job Acquisition #811

Specific DBOS Implementation Details Providing An Example Of A Highly Scaleable Implementation Built On A Solid Foundation

1. FOR UPDATE SKIP LOCKED Pattern

2. DBOS Schema Design

3. Performance Benefits

Performance Evidence

Implementation References

Suggested Migration Path

PostgreSQL Version Requirement

Expected Benefits

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[PG 9.5+] Feature Request: Use FOR UPDATE SKIP LOCKED for Job Acquisition #811

Description

Specific DBOS Implementation Details Providing An Example Of A Highly Scaleable Implementation Built On A Solid Foundation

1. FOR UPDATE SKIP LOCKED Pattern

2. DBOS Schema Design

3. Performance Benefits

Performance Evidence

Implementation References

Suggested Migration Path

PostgreSQL Version Requirement

Expected Benefits

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions