NotFound(Source) error with S3 metastore and SQS ingestion after red/black deployment #5782

stevehobbsdev · 2025-06-02T14:13:20Z

Describe the bug
We're encountering a persistent issue where Quickwit fails with NotFound(Source) errors after several hours when using S3-based metastore and SQS ingestion, especially after a red/black deployment in Kubernetes.

For our red/black deployments, we have configured it in such a way that:

When we started seeing the issue, we had 2 instances of the metastore running (1 per cluster). However, we have since tried a configuration that only has 1 metastore instance when the clusters overlap but this did not improve the situation.
We only have one instance each of both the indexer and janitor services across both clusters

Unfortunately we're unable to concretely see what is causing this. After some time (sometimes a few hours), the indexer and control-plane start reporting:

failed to prune shards error=NotFound(Source { index_id: "logs", source_id: "sqs-logs-filesource" })
2025-05-28T22:30:15.186Z ERROR quickwit_actors::spawn_builder: actor-failure cause=source `logs/sqs-logs-filesource` not found exit_status=Failure(source `logs/sqs-logs-filesource` not found)
2025-05-28T22:30:15.186Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=SourceActor-winter-CheT exit_status=failure(cause=source `logs/sqs-logs-filesource` not found)
2025-05-28T22:30:15.186Z ERROR quickwit_actors::actor_context: exit activating-kill-switch actor=SourceActor-winter-CheT exit_status=Failure(source `logs/sqs-logs-filesource` not found)
2025-05-28T22:30:15.186Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=quickwit_indexing::actors::doc_processor::DocProcessor-empty-P75y exit_status=killed
2025-05-28T22:30:15.186Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Indexer-ancient-wcY5 exit_status=killed

All services (indexer, searcher, control-plane) complain that the source cannot be found.
As a result, Quickwit does not ingest logs that are waiting in the queue and S3 buckets
However, the metastore file in S3 correctly lists both the index and the source.
Running quickwit CLI inside the pod can retrieve the source as expected.
Restarting pods does not resolve the issue.
Occasionally, re-uploading the S3 file and restarting services does fix the issue temporarily.

The issue appears to coincide with shard rebalancing:

indexer `quickwit-indexer-0` joined the cluster: rebalancing shards and rebuilding indexing plan

Only occurs with SQS ingestion. Kafka ingestion and ingest API (for the same index) work without issue.
Using a single S3 metastore, but red/black deployments may result in two metastores existing temporarily.
No S3 locking is observed to be in play.
Attempts to set polling_interval had no effect.
The issue seems tied to shard/source distribution handled by get_shards_for_source_mut, which differs in implementation between file-backed and Postgres metastores.

We've noticed that:

Manually downloading, modifying, and re-uploading the S3 metastore file — sometimes resolves issue temporarily.
Using the quickwit CLI to confirm index/source visibility inside pod — confirms metastore contents are intact.

Steps to reproduce (if applicable)

Unfortunately we don't know how to reproduce this reliably. Sometimes it will fix itself after we redeploy the cluster and then break on its own within hours, or break after we do a red/black deployment.

Expected behavior
We expect the SQS configuration to continue working after deployments.

Configuration:
Please provide:

Output of quickwit --version: Quickwit 0.8.0 (x86_64-unknown-linux-gnu 2025-04-23T14:04:34Z 3a070c8)
The index_config.yaml:

doc_mapping:
      field_mappings:
      - fast: true
        fast_precision: nanoseconds
        input_formats:
        - iso8601
        - unix_timestamp
        name: timestamp
        output_format: unix_timestamp_secs
        type: datetime
      mode: dynamic
      timestamp_field: timestamp
    index_id: logs
    index_uri: s3://l0-<redacted>-use1-quickwit-indexes/indexes
    indexing_settings:
      commit_timeout_secs: 10
      merge_policy:
        max_merge_factor: 12
        max_merge_ops: 3
        merge_factor: 10
        type: limit_merge
      resources:
        max_merge_write_throughput: 80mb
    retention:
      period: 7 days
      schedule: '@daily'
    version: 0.8

For reference we also raised the issue in the Quickwit community discord: https://discord.com/channels/908281611840282624/1377414921490272487

The text was updated successfully, but these errors were encountered:

poovamraj · 2025-06-03T13:58:23Z

Ok found the issue on this!

Basically we are removing the source from shards in metastore if the shards are empty. This means the next time the metastore comes back up, it is in a inconsistent state.

We have to fix this line to return a empty array instead of removing the source itself - https://github.com/quickwit-oss/quickwit/blob/main/quickwit/quickwit-metastore/src/metastore/file_backed/file_backed_index/serialize.rs#L91

Can anyone confirm if my analysis is good? We are able to reproduce this and fix it by adding a source with empty array.

Also to ensure recovery, when we are loading shards, if source is not found, wouldn't it be better to add them?

@guilload @trinity-1686a @rdettai can anyone confirm if this is right analysis?

guilload · 2025-06-03T14:45:34Z

that would be a question for @rdettai. Thanks for investigating.

stevehobbsdev added the bug Something isn't working label Jun 2, 2025

guilload assigned rdettai Jun 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NotFound(Source) error with S3 metastore and SQS ingestion after red/black deployment #5782

NotFound(Source) error with S3 metastore and SQS ingestion after red/black deployment #5782

stevehobbsdev commented Jun 2, 2025 •

edited

Loading

poovamraj commented Jun 3, 2025 •

edited

Loading

Uh oh!

guilload commented Jun 3, 2025

Uh oh!

NotFound(Source) error with S3 metastore and SQS ingestion after red/black deployment #5782

NotFound(Source) error with S3 metastore and SQS ingestion after red/black deployment #5782

Comments

stevehobbsdev commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

poovamraj commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

guilload commented Jun 3, 2025

Uh oh!

stevehobbsdev commented Jun 2, 2025 •

edited

Loading

poovamraj commented Jun 3, 2025 •

edited

Loading