Skip to content

ref(aci): use enqueue time to query Snuba in delayed workflow #93882

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

cathteng
Copy link
Member

For a slow alert condition (e.g. "> 10 events in the past 1 minute"), we always use the time the query is being made as the time to start looking backwards from. This is not optimal in the case where:

  • A task needs to be rerun
  • There is a backlog in the delayed workflow queue
  • Because we flush the buffer every minute, here is an edge case where the time period is 1 minute: events come in at 11:00:02 am, are dispatched in a task at ~11:01:00am, the query is run at 11:01:03am or later, which would mean those events aren't picked up in the result

A better solution would be to use the time the event is enqueued into the delayed workflow buffer!

However, we also batch queries such that all alerts that end up making the same queries are grouped together, and we run a single query for multiple groups. For example:

  • Two events for two different groups come in a different times within the same minute, are both processed by alert A
  • We will make a single Snuba query for group A with the two groups

In the above case, we'll need to decide which enqueue time to use. We should use the latest enqueue time out of a list of groups as it will cover all groups being queried for in the Snuba query.

Copy link

sentry-io bot commented Jun 18, 2025

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: src/sentry/workflow_engine/processors/workflow.py

Function Unhandled Issue
evaluate_workflows_action_filters SoftTimeLimitExceeded: SoftTimeLimitExceeded() se...
Event Count: 3

Did you find this useful? React with a 👍 or 👎

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jun 18, 2025
@cathteng cathteng requested a review from kcons June 18, 2025 22:49
@cathteng cathteng marked this pull request as ready for review June 18, 2025 22:49
@cathteng cathteng requested a review from a team as a code owner June 18, 2025 22:50
Use the latest timestamp for a set of group IDs with the same Snuba query.
We will query backwards in time from this point.
"""
if self.timestamp is None or (timestamp is not None and timestamp > self.timestamp):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if timestamp is not None:
self.timestamp = timestamp if self.timestamp is None else max(timestamp, self.timestamp)

perhaps.



@dataclass
class TimeAndGroups:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is BulkQueryParameters is more accurate name? Or perhaps GroupQueryParameters? There's a distinction between the unique queries and what we're associating them with that I'm not sure I'm capturing accurately, but TimeAndGroups strikes me as a bit too literal.

def dcg_to_timestamp(self) -> dict[int, datetime | None]:
"""
Uses the latest timestamp each DataConditionGroup was enqueued with
All groups enqueued for a DataConditionGroup will have the same query, hence the same max timestamp.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I should know what this means, but I don't.

handler = unique_condition.handler()
group_ids = time_and_groups.group_ids
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another option is to make groups be a dict[GroupId, datetime | None] and
do time = max(ts for ts in groups.values() if ts, default=current_time).
Stores a bit more data, but lets the summarizing happen where it is being forced, which has a certain appeal.

@@ -75,6 +77,7 @@ class DelayedWorkflowItem:
delayed_conditions: list[DataCondition]
event: GroupEvent
source: WorkflowDataConditionGroupType
timestamp: datetime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth explaining what this timestamp is and what it should correspond to. Or, rename the field to make the comment pointless.

{
"event_id": self.event.event_id,
"occurrence_id": self.event.occurrence_id,
"timestamp": self.timestamp,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL we can dumps a datetime.

@@ -79,6 +79,7 @@
class EventInstance(BaseModel):
event_id: str
occurrence_id: str | None = None
timestamp: datetime | None = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should probably add a test that requires us to parse this value correctly from the expected format. I strongly suspect we don't.

Copy link
Member

@kcons kcons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need test to verify we can parse it and pydantic won't freak out.

Copy link

codecov bot commented Jun 18, 2025

Codecov Report

Attention: Patch coverage is 86.11111% with 10 lines in your changes missing coverage. Please review.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...try/workflow_engine/processors/delayed_workflow.py 65.51% 10 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master   #93882    +/-   ##
========================================
  Coverage   88.03%   88.04%            
========================================
  Files       10337    10337            
  Lines      596922   597052   +130     
  Branches    23196    23196            
========================================
+ Hits       525525   525647   +122     
- Misses      70938    70946     +8     
  Partials      459      459            

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Scope: Backend Automatically applied to PRs that change backend components
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants