-
-
Couldn't load subscription status.
- Fork 522
Description
Is your feature request related to a problem?
Yes, the current queue_job implementation uses PostgreSQL's LISTEN/NOTIFY mechanism for job coordination, which unfortunately has scaling issues:
- Lock Contention at Scale: LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. The mechanism applies database-wide locks that become a bottleneck with many workers.
When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (or to be pedantic, a global lock on all databases within the postgres instance - https://github.com/postgres/postgres/blob/a749c6f18fbacd05f432cd29f9e7294033bc666f/src/backend/commands/async.c#L956) during the commit phase of the transaction, effectively serializing all commits. Under many concurrent writers, this results in immense load and major downtime.
Reference:
https://www.recall.ai/blog/postgres-listen-notify-does-not-scale
-
Race Conditions: The current implementation experiences race conditions where "two jobrunners end up in a race condition, competing to handle the same tasks listed in the queue job table" (see odoo.sh deployment issues).
-
Concurrency Errors: Users report frequent
TransactionRollbackError: could not serialize access due to concurrent updateerrors when multiple workers attempt to process jobs. -
Complex Workarounds Required: This scalability issue has necessitated complex workarounds like:
- Advisory locks for multi-node coordination (PR [13.0] Support multi-nodes with lock on jobrunner #256)
- Alternative implementations like queue_job_cron_jobrunner
- Limitations on multi-worker deployments
Describe the solution you'd like
Implement job acquisition using PostgreSQL's FOR UPDATE SKIP LOCKED clause, following the pattern successfully used by DBOS, Solid Queue, PG Boss, and other queue systems. The implementation should:
- Replace the core job acquisition query with something similar to:
UPDATE queue_job
SET state = 'started',
started_at = now(),
worker_id = %(worker_id)s
WHERE id IN (
SELECT id
FROM queue_job
WHERE state = 'pending'
AND scheduled_at <= now()
ORDER BY priority, created_at
LIMIT %(batch_size)s
FOR UPDATE SKIP LOCKED
)
RETURNING *;- Maintain NOTIFY for job insertion (optional) - NOTIFY can still be used to wake up idle workers when new jobs arrive, but not for coordination.
Describe alternatives you've considered
Hybrid approach: Use SKIP LOCKED for job acquisition but keep LISTEN/NOTIFY for real-time notifications. This provides the best of both worlds.
Additional context
Specific DBOS Implementation Details Providing An Example Of A Highly Scaleable Implementation Built On A Solid Foundation
Based on Qian Li's insights and the DBOS Transact implementation (https://github.com/dbos-inc/dbos-transact-py):
1. FOR UPDATE SKIP LOCKED Pattern
DBOS explains their approach in their blog post "Why Postgres is a Good Choice for Durable Workflow Execution" (https://www.dbos.dev/blog/why-postgres-durable-execution):
"Postgres provides a solution: locking clauses. Selecting rows in this way does two things. First, it locks the rows so that other workers cannot also select them. Second, it skips rows that are already locked, selecting not the N oldest enqueued workflows, but the N oldest enqueued workflows that are not already locked by another worker."
This allows "many workers to concurrently pull new workflows without contention. One worker selects the oldest N workflows and locks them, the second worker selects the next oldest N workflows and locks those, and so on."
2. DBOS Schema Design
According to their release notes, "DBOS merged dbos.workflow_inputs and dbos.workflow_queue tables into a unified dbos.workflow_status table" for better performance. This unified approach reduces joins and improves query performance when using FOR UPDATE SKIP LOCKED.
3. Performance Benefits
DBOS reports that their approach enables "a durable workflow system to process tens of thousands of workflows per second across thousands of workers" by greatly reducing contention.
Performance Evidence
- QueueClassic benchmarks showed ~2.5x performance improvement when switching to SKIP LOCKED (from 929 jobs/sec to 2352 jobs/sec). Not an apples to apples comparison, but worth noting.
- SKIP LOCKED allows transactions to claim jobs without ever blocking each other
- Other battle-tested implementations such as Solid Queue, PG Boss, and DBOS all rely on this pattern
Implementation References
- DBOS Transact: https://github.com/dbos-inc/dbos-transact-py - Successfully uses SKIP LOCKED for their durable queue implementation
- DBOS Blog on Postgres Queues: https://www.dbos.dev/blog/why-postgres-durable-execution
- PQ (Python Queue): Uses SKIP LOCKED and achieves ~1000 ops/second with the pattern
- Solid Queue: Rails' modern queue system built on SKIP LOCKED
- Inferable's detailed explanation: https://www.inferable.ai/blog/posts/postgres-skip-locked
Suggested Migration Path
- Add a feature flag to enable SKIP LOCKED mode
- Implement the new acquisition query alongside the existing one
- Utilise existing exponential backoff with jitter feature for polling intervals
- Provide migration documentation for existing deployments
PostgreSQL Version Requirement
SKIP LOCKED is available in PostgreSQL 9.5+, which has been available since 2016. Given that PostgreSQL 9.4 reached EOL in 2020, this should not be a barrier.
Expected Benefits
- 10-100x better scalability under high worker count. DBOS has run this on the MIT SuperCloud and tested this running across many 10's of thousands of cores: https://people.eecs.berkeley.edu/~matei/papers/2022/cidr_dbos.pdf
- Elimination of race conditions and concurrent update errors
- Simplified codebase removing the need for some more complex locking mechanisms
- Better compatibility with cloud deployments and container orchestration
- Support for true multi-node deployments without advisory lock complications
This change would bring OCA/queue in line with other battle-tested queue implementations and resolve some scalability bottlenecks.