Skip to content

Conversation

@chrisberkhout
Copy link
Contributor

@chrisberkhout chrisberkhout commented Aug 1, 2025

Proposed commit message

[o365] Surface credential or subscription errors in Fleet status

Instead of ignoring errors and returning successes, if there are any
errors while creating/checking subscriptions, return an error showing
the problem, wait the interval time, and repeat everything next time.

This is intended to avoid silencing errors such as AF10001 "The
permission set () sent in the request does not include the expected
permission.".

Notes

The changelog.yml entry assumes #14730 will be merged first.

Checklist

  • I have reviewed tips for building integrations and this pull request is aligned with them.
  • I have verified that all data streams collect metrics or logs.
  • I have added an entry to my package's changelog.yml file.
  • I have verified that Kibana version constraints are current according to guidelines.
  • I have verified that any added dashboard complies with Kibana's Dashboard good practices

How to test this PR locally

In the system test data_stream/audit/_dev/test/system/test-cel-bad-creds-config.yml, comment out the skip, run it with --defer-cleanup 10m and inspect the error document produced.

Related issues

@chrisberkhout chrisberkhout self-assigned this Aug 1, 2025
@chrisberkhout chrisberkhout requested a review from a team as a code owner August 1, 2025 14:48
@chrisberkhout chrisberkhout added enhancement New feature or request Integration:o365 Microsoft Office 365 Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations] labels Aug 1, 2025
@elasticmachine
Copy link

Pinging @elastic/security-service-integrations (Team:Security-Service Integrations)

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you have a way to locally test the error case, perhaps we should commit the supporting test files (mocked API setup for elastic/stream and the _dev/test/system/test-bad-creds-config.yml file). However, we should skip the test using skip.link until there is a way to allow the test to run and assert that the Agent goes into a degraded state and produces an error document.

There's a risk that these files could become outdated since they aren't exercised automatically, but at least we capture how it was tested and have a way to replicate the service credential errors locally.

@andrewkroh andrewkroh requested a review from a team August 1, 2025 15:10
@elastic-vault-github-plugin-prod

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

@chrisberkhout
Copy link
Contributor Author

Assuming you have a way to locally test the error case, perhaps we should commit the supporting test files (mocked API setup for elastic/stream and the _dev/test/system/test-bad-creds-config.yml file). However, we should skip the test using skip.link until there is a way to allow the test to run and assert that the Agent goes into a degraded state and produces an error document.

There's a risk that these files could become outdated since they aren't exercised automatically, but at least we capture how it was tested and have a way to replicate the service credential errors locally.

Good idea. I added it.

},
}
events_list.collate("events_per_content_type").as(events,
events.filter(e, e.?error.message.hasValue()).as(error_events,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
events.filter(e, e.?error.message.hasValue()).as(error_events,
events.filter(e, has(e.?error.message)).as(error_events,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the advantage of the suggested change?

I went with the original because then its just using the optional functionality, not mixing with the different field presence macro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .? allocates the optional value, the has macro does not need to do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. To avoid the optional value, it'd be has(e.error) && has(e.error.message), right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this kind of optimization is worth it, but I put it in an additional commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it can be has(e.?error.message).

@chrisberkhout chrisberkhout force-pushed the o365-error-fleet-status branch from 5f7630d to 3a33fd4 Compare August 4, 2025 07:05
Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I ran the failing test case. It surfaces the error.

Screenshot 2025-08-04 at 06 37 17
Event
{
  "@timestamp": "2025-08-04T10:36:59.276Z",
  "agent": {
    "ephemeral_id": "74451a3d-aa31-408d-b3e6-7213488a53d3",
    "id": "2ae3f72c-87b7-4dbd-82ef-1a67b3fd161d",
    "name": "elastic-agent-22243",
    "type": "filebeat",
    "version": "9.2.0"
  },
  "data_stream": {
    "dataset": "o365.audit",
    "namespace": "65280",
    "type": "logs"
  },
  "ecs": {
    "version": "8.11.0"
  },
  "elastic_agent": {
    "id": "2ae3f72c-87b7-4dbd-82ef-1a67b3fd161d",
    "snapshot": true,
    "version": "9.2.0"
  },
  "error": {
    "message": "Audit.TypeRequiringAdditionalPermissions: AF10001 The permission set (...) sent in the request does not include the expected permission."
  },
  "event": {
    "agent_id_status": "verified",
    "category": [
      "web"
    ],
    "dataset": "o365.audit",
    "ingested": "2025-08-04T10:37:02Z",
    "kind": "pipeline_error",
    "original": "null",
    "outcome": "success",
    "type": [
      "info"
    ]
  },
  "input": {
    "type": "cel"
  },
  "tags": [
    "preserve_original_event",
    "forwarded",
    "o365-cel"
  ]
}
Log Message
{
  "log.level": "error",
  "@timestamp": "2025-08-04T10:36:59.276Z",
  "message": "single event object returned by evaluation",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "cel-default",
    "type": "cel"
  },
  "log": {
    "source": "cel-default"
  },
  "log.origin": {
    "file.line": 433,
    "file.name": "cel/input.go",
    "function": "github.com/elastic/beats/v7/x-pack/filebeat/input/cel.input.run.func1"
  },
  "id": "cel-o365.audit-f8938c7f-bf6b-42a1-a582-7ad971dc0f98",
  "log.logger": "input.cel",
  "service.name": "filebeat",
  "input_source": "http://svc-o365-cel:8080",
  "input_url": "http://svc-o365-cel:8080",
  "error": {
    "message": "Audit.TypeRequiringAdditionalPermissions: AF10001 The permission set (...) sent in the request does not include the expected permission."
  },
  "ecs.version": "1.6.0"
}

{
"events_per_content_type": [],
"events_per_content_type": [{
"error": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the HTTP status code and status to the error.code and error.idfields, respectively. I've been finding myself using those fields lot when they are available to filter or group the single event object returned by evaluation log messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Done in a new commit.

If there's a permission problem with just the DLP.All content type (which requires an extra permission) then there will be one response involved, but otherwise it'll likely be one per content type. The messages were already combined. The new code de-duplicates the error.code and error.id values. If there's one value it'll be a scalar (likely), otherwise an array.

The output looks like this for the test:

{
  ...
  "error": {
    "code": "401",
    "id": "401 Unauthorized",
    "message": "Audit.TypeRequiringAdditionalPermissions: AF10001 The permission set (...) sent in the request does not include the expected permission."
  },
  ...
}

@chrisberkhout chrisberkhout force-pushed the o365-error-fleet-status branch from 3a33fd4 to 811a328 Compare August 4, 2025 14:29
@chrisberkhout chrisberkhout force-pushed the o365-error-fleet-status branch from 7ea774f to 471b02e Compare August 6, 2025 09:16
@chrisberkhout chrisberkhout enabled auto-merge (squash) August 6, 2025 09:16
@chrisberkhout chrisberkhout force-pushed the o365-error-fleet-status branch from 471b02e to 24df591 Compare August 6, 2025 09:42
@chrisberkhout chrisberkhout merged commit 50b6b4f into elastic:main Aug 6, 2025
7 checks passed
@elasticmachine
Copy link

💚 Build Succeeded

History

cc @chrisberkhout

@elastic-sonarqube
Copy link

@elastic-vault-github-plugin-prod

Package o365 - 2.21.0 containing this change is available at https://epr.elastic.co/package/o365/2.21.0/

robester0403 pushed a commit to robester0403/integrations that referenced this pull request Aug 14, 2025
…stic#14771)

Instead of ignoring errors and returning successes, if there are any
errors while creating/checking subscriptions, return an error showing
the problem, wait the interval time, and repeat everything next time.

This is intended to avoid silencing errors such as AF10001 "The
permission set () sent in the request does not include the expected
permission.".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Integration:o365 Microsoft Office 365 Team:Security-Service Integrations Security Service Integrations team [elastic/security-service-integrations]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants