-
Notifications
You must be signed in to change notification settings - Fork 515
[o365] Surface credential or subscription errors in Fleet status #14771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[o365] Surface credential or subscription errors in Fleet status #14771
Conversation
|
Pinging @elastic/security-service-integrations (Team:Security-Service Integrations) |
andrewkroh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you have a way to locally test the error case, perhaps we should commit the supporting test files (mocked API setup for elastic/stream and the _dev/test/system/test-bad-creds-config.yml file). However, we should skip the test using skip.link until there is a way to allow the test to run and assert that the Agent goes into a degraded state and produces an error document.
There's a risk that these files could become outdated since they aren't exercised automatically, but at least we capture how it was tested and have a way to replicate the service credential errors locally.
🚀 Benchmarks reportTo see the full report comment with |
Good idea. I added it. |
| }, | ||
| } | ||
| events_list.collate("events_per_content_type").as(events, | ||
| events.filter(e, e.?error.message.hasValue()).as(error_events, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| events.filter(e, e.?error.message.hasValue()).as(error_events, | |
| events.filter(e, has(e.?error.message)).as(error_events, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the advantage of the suggested change?
I went with the original because then its just using the optional functionality, not mixing with the different field presence macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The .? allocates the optional value, the has macro does not need to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. To avoid the optional value, it'd be has(e.error) && has(e.error.message), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this kind of optimization is worth it, but I put it in an additional commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it can be has(e.?error.message).
5f7630d to
3a33fd4
Compare
andrewkroh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I ran the failing test case. It surfaces the error.
Event
{
"@timestamp": "2025-08-04T10:36:59.276Z",
"agent": {
"ephemeral_id": "74451a3d-aa31-408d-b3e6-7213488a53d3",
"id": "2ae3f72c-87b7-4dbd-82ef-1a67b3fd161d",
"name": "elastic-agent-22243",
"type": "filebeat",
"version": "9.2.0"
},
"data_stream": {
"dataset": "o365.audit",
"namespace": "65280",
"type": "logs"
},
"ecs": {
"version": "8.11.0"
},
"elastic_agent": {
"id": "2ae3f72c-87b7-4dbd-82ef-1a67b3fd161d",
"snapshot": true,
"version": "9.2.0"
},
"error": {
"message": "Audit.TypeRequiringAdditionalPermissions: AF10001 The permission set (...) sent in the request does not include the expected permission."
},
"event": {
"agent_id_status": "verified",
"category": [
"web"
],
"dataset": "o365.audit",
"ingested": "2025-08-04T10:37:02Z",
"kind": "pipeline_error",
"original": "null",
"outcome": "success",
"type": [
"info"
]
},
"input": {
"type": "cel"
},
"tags": [
"preserve_original_event",
"forwarded",
"o365-cel"
]
}Log Message
{
"log.level": "error",
"@timestamp": "2025-08-04T10:36:59.276Z",
"message": "single event object returned by evaluation",
"component": {
"binary": "filebeat",
"dataset": "elastic_agent.filebeat",
"id": "cel-default",
"type": "cel"
},
"log": {
"source": "cel-default"
},
"log.origin": {
"file.line": 433,
"file.name": "cel/input.go",
"function": "github.com/elastic/beats/v7/x-pack/filebeat/input/cel.input.run.func1"
},
"id": "cel-o365.audit-f8938c7f-bf6b-42a1-a582-7ad971dc0f98",
"log.logger": "input.cel",
"service.name": "filebeat",
"input_source": "http://svc-o365-cel:8080",
"input_url": "http://svc-o365-cel:8080",
"error": {
"message": "Audit.TypeRequiringAdditionalPermissions: AF10001 The permission set (...) sent in the request does not include the expected permission."
},
"ecs.version": "1.6.0"
}| { | ||
| "events_per_content_type": [], | ||
| "events_per_content_type": [{ | ||
| "error": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the HTTP status code and status to the error.code and error.idfields, respectively. I've been finding myself using those fields lot when they are available to filter or group the single event object returned by evaluation log messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Done in a new commit.
If there's a permission problem with just the DLP.All content type (which requires an extra permission) then there will be one response involved, but otherwise it'll likely be one per content type. The messages were already combined. The new code de-duplicates the error.code and error.id values. If there's one value it'll be a scalar (likely), otherwise an array.
The output looks like this for the test:
{
...
"error": {
"code": "401",
"id": "401 Unauthorized",
"message": "Audit.TypeRequiringAdditionalPermissions: AF10001 The permission set (...) sent in the request does not include the expected permission."
},
...
}3a33fd4 to
811a328
Compare
7ea774f to
471b02e
Compare
Skipped until negative system tests are possible.
471b02e to
24df591
Compare
💚 Build Succeeded
History
|
|
|
Package o365 - 2.21.0 containing this change is available at https://epr.elastic.co/package/o365/2.21.0/ |
…stic#14771) Instead of ignoring errors and returning successes, if there are any errors while creating/checking subscriptions, return an error showing the problem, wait the interval time, and repeat everything next time. This is intended to avoid silencing errors such as AF10001 "The permission set () sent in the request does not include the expected permission.".




Proposed commit message
Notes
The
changelog.ymlentry assumes #14730 will be merged first.Checklist
changelog.ymlfile.How to test this PR locally
In the system test
data_stream/audit/_dev/test/system/test-cel-bad-creds-config.yml, comment out the skip, run it with--defer-cleanup 10mand inspect the error document produced.Related issues