Skip to content

engine: expose internal logging call counts as internal metrics #10326

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 28, 2025

Conversation

alecholmes
Copy link
Contributor

This PR adds a new v2 runtime metric that exposes the number of logger calls by message type. A fluent-bit process consistently logging errors can be indicative of significant configuration or infrastructure problems. A common pattern for observing failures across many instances of software is to expose failures as metric counters that can then be observed and alerted on.

The implementation piggybacks on the src/flb_log.c logging library already extracting a worker context from the current thread.

Here is the example output of curling a fluent-bit with a service http_server enabled:

> curl localhost:5432/api/v2/metrics/prometheus 2>&1 | grep logger

fluentbit_logger_logs_total{severity="error"} 2
fluentbit_logger_logs_total{severity="warn"} 0
fluentbit_logger_logs_total{severity="info"} 10
fluentbit_logger_logs_total{severity="debug"} 0
fluentbit_logger_logs_total{severity="trace"} 0
fluentbit_logger_logs_total{severity="help"} 0

Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • [N/A] Example configuration file for the change
  • Debug log output from testing the change (see example from curling above)
  • Attached Valgrind output that shows no leaks or memory corruption was found
> valgrind -s bin/flb-rt-core_internal_logger

SUCCESS: All unit tests have passed.
==118424==
==118424== HEAP SUMMARY:
==118424==     in use at exit: 0 bytes in 0 blocks
==118424==   total heap usage: 2,098 allocs, 2,098 frees, 720,751 bytes allocated
==118424==
==118424== All heap blocks were freed -- no leaks are possible
==118424==
==118424== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • [N/A] Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature (I will create a docs PR to update the metric name table once this PR is approved)

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

return NULL;
}

ret_ctx->u = flb_upstream_create(ret_ctx->config, "127.0.0.1", 2020, 0, NULL);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any pattern or prior art for picking random free ports in tests?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not but we can talk about it if you're interested in making that improvement, I'd appreciate it.

@alecholmes alecholmes force-pushed the alec/2025-05-08-expose-log-counts branch from d83dbeb to 68ede1c Compare May 13, 2025 15:52
@alecholmes alecholmes requested a review from niedbalski as a code owner May 13, 2025 15:52
@alecholmes alecholmes force-pushed the alec/2025-05-08-expose-log-counts branch from 68ede1c to e61c2c0 Compare May 13, 2025 17:11
@alecholmes
Copy link
Contributor Author

I'm not convinced this PR introduced the fuzzer failures since I'm able to repro them on master.

The signv4_fuzzer failure, for example, seems to have been introduced at some point between 352bb31 and 6899dc1 -- it's hard to bisect because the commits in the middle of that range do not compile.

@@ -164,6 +164,7 @@ struct flb_config {
/* Logging */
char *log_file;
struct flb_log *log;
struct flb_log_metrics *log_metrics_ctx; /* Global metrics for logging calls */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this metrics context supposed to be limited to the logger instance? If that's the case then it should be declared inside if flb_log instead of flb_config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the metrics to flb_log and the lifecycle of them is now managed by flb_log_create and flb_log_destroy.

@@ -232,6 +242,8 @@ static inline int flb_log_suppress_check(int log_suppress_interval, const char *
int flb_log_worker_init(struct flb_worker *worker);
int flb_log_worker_destroy(struct flb_worker *worker);
int flb_errno_print(int errnum, const char *file, int line);
struct flb_log_metrics *flb_log_metrics_create();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two shouldn't be public as they should be only invoked by flb_log_create and flb_log_destroy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed from the header file.

src/flb_config.c Outdated
@@ -391,6 +391,14 @@ struct flb_config *flb_config_init()
flb_regex_init();
#endif

/* Create internal logger metrics */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be moved to flb_log_create

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/flb_log.c Outdated
@@ -573,39 +597,32 @@ int flb_log_construct(struct log_message *msg, int *ret_len,
int total;
time_t now;
const char *header_color = NULL;
const char *header_title = NULL;
const char *header_title = flb_log_message_type_str(type);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't make function calls in the middle of the declarations, initialize it to NULL and move the function call to line 606.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

src/flb_log.c Outdated
@@ -564,6 +566,28 @@ struct flb_log *flb_log_create(struct flb_config *config, int type,
return log;
}

static inline char *flb_log_message_type_str(int type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the return type to const char * to match the current usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

src/flb_log.c Outdated
return lm;

error:
if (lm && lm->logs_total_counter) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this code was to stay this would be a much cleaner structure which I'd ask you to follow:

    if (lm != NULL) {
        if (lm->logs_total_counter != NULL) {
            cmt_counter_destroy(lm->logs_total_counter);
        }

        if (lm->cmt != NULL) {
            cmt_destroy(lm->cmt);
        }

        flb_free(lm);
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted to not using goto. The pattern of using goto to jump to a cleanup section that effectively unwinds the allocations was something Phil and I had talked a bit about ahead of the PR, but I can bring that up as a discussion separately.

src/flb_log.c Outdated
break;
}

if (cmt_counter_set(lm->logs_total_counter, ts, 0, 1, (char *[]) {message_type_str}) == -1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't put function calls inside of the conditions, it's ok to use simpler functions such as strlen sometimes there but this is not the case, declare a variable, store the result value there and then compare it.

The compiler will optimize it anyway and the result code will remain in a register in 99% of the cases, especially in the modern fastcall-ish conventions (amd64 & arm).

src/flb_log.c Outdated
* Initialize counters for all log message types to 0.
* This assumes types are contiguous starting at 1 (FLB_LOG_ERROR).
*/
i = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i is already initialized by the for, remove this.

return NULL;
}

ret_ctx->u = flb_upstream_create(ret_ctx->config, "127.0.0.1", 2020, 0, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's not but we can talk about it if you're interested in making that improvement, I'd appreciate it.

src/flb_log.c Outdated
/* Frees the metrics instance and its associated resources. */
void flb_log_metrics_destroy(struct flb_log_metrics *metrics)
{
if (metrics != NULL && metrics->cmt != NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics != NULL should be checked only once at the beginning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had gotten earlier feedback to have this function to clean up partially initialized flb_log_metrics* instances. Here's what I just pushed to do the safe minimum number of null checks:

void flb_log_metrics_destroy(struct flb_log_metrics *metrics)
{
    if (metrics == NULL) {
        return;
    }
    if (metrics->cmt != NULL) {
        cmt_destroy(metrics->cmt);
    }
    flb_free(metrics);
}

src/flb_sds.c Outdated
@@ -33,7 +33,7 @@
#include <stdarg.h>
#include <ctype.h>

static flb_sds_t sds_alloc(size_t size)
static flb_sds_t flb_sds_alloc_internal(size_t size)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since CFL was already updated, it seems these changes in flb_sds are not necessary

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Just rebased and reverted this rename.

Internal logger calls increment a new v2 metric exposed by the HTTP server
Prometheus scrape endpoint. There is one time series per log message type.

Signed-off-by: Alec Holmes <[email protected]>
@alecholmes
Copy link
Contributor Author

@edsiper Thanks for taking a look. Pushed changes addressing your feedback in a new temporary commit.

@edsiper edsiper added this to the Fluent bit v4.0.3 milestone May 28, 2025
@edsiper edsiper merged commit 967c48f into fluent:master May 28, 2025
49 checks passed
@edsiper
Copy link
Member

edsiper commented May 28, 2025

by mistake I merged before squashing commits.. fixing that up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants