You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, ingester may end up accepting persist request when their disk is full.
If the OS buffer is not full, no error might be returned.
We need to poll-check for the disk usage, and change the behavior of quickwit when it goes above
a threshold.
The behavior is yet to be decided. Probably, the closest thing is decommissionning: close all shards and not accept the creation of new shards. In addition, it might not be possible to run indexing/merge pipelines; which could really make the control plane's task hard.
The text was updated successfully, but these errors were encountered:
@fulmicoton How did you identify the main problem comes from records accumulating in the OS buffer? I thought the OS buffer would usually be quite small (few MBs).
It seems to me that the problem might also come from the persist policy that is configured on mrecordlogs. A full disk is only detected after the persist delay (5s), and when that happens, the error is bubbled up and converted to a persist failure here. The problem is that when that happens, a transient error is returned to the user, but meanwhile the shard is closed, a new one is opened, and records are accepted again during the mrecordlog persist delay. I didn't manage to reproduce it yet, but does this seem like a plausible explanation to you?
EDIT: I tried to mimic the WAL disk being full using a small loop device mounted on wal/
How did you identify the main problem comes from records accumulating in the OS buffer? I thought the OS buffer would usually be quite small (few MBs).
Just an hypothesis to explain how we could accept message and eventually lose them.
The problem is that when that happens, a transient error is returned to the user, but meanwhile the shard is closed, a new one is opened, and records are accepted again during the mrecordlog persist delay. I didn't manage to reproduce it yet, but does this seem like a plausible explanation to you?
Plausible yes, but we still need to know by which mechanism we end up accepting writes sometimes. Mezmo mentions they lost data.
Uh oh!
There was an error while loading. Please reload this page.
Currently, ingester may end up accepting persist request when their disk is full.
If the OS buffer is not full, no error might be returned.
We need to poll-check for the disk usage, and change the behavior of quickwit when it goes above
a threshold.
The behavior is yet to be decided. Probably, the closest thing is decommissionning: close all shards and not accept the creation of new shards. In addition, it might not be possible to run indexing/merge pipelines; which could really make the control plane's task hard.
The text was updated successfully, but these errors were encountered: