-
Notifications
You must be signed in to change notification settings - Fork 389
add multiline log support for s3 logs #111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for putting this together @ericmustin.
I'll let @DataDog/logs-intake do the final review before merging.
@@ -277,8 +281,14 @@ def s3_handler(event, context, metadata): | |||
) | |||
yield structured_line | |||
else: | |||
# Check if using multiline log regex pattern | |||
if DD_MULTILINE_LOG_REGEX_PATTERN: | |||
split_data = re.compile("(?<!^)\s+(?=%s)(?!.\s)" % (DD_MULTILINE_LOG_REGEX_PATTERN,)).split(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens when there is no match ? i.e. there is two different kind of inputs, a pattern separated one and a line separated one ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point, in the case where there is both types of s3 logs and the format varies, the line separated kind would not get split out into separate log lines. It assumes each log line, single or multi, begins with the same pattern at the start of the line ie (?<!^)\s+(?=<insert_pattern>)
Open to suggestions on how to approach this, I pushed up a somewhat hacky attempt to solve this (i think it is a solution at least ) which checks, when a multiline_regex env var is supplied, if there are any matches for that pattern ,and if there aren't then assumes it is line separated and calls .splitlines()
as usual. My understanding is that every event
passed into s3_handler()
would contain only 1 type, either pattern or line separated, but if that is not the case please let me know.
Also open to scrapping this we feel like it opens up a huge rabbit hole
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but I just have concern, see https://github.com/DataDog/datadog-serverless-functions/pull/111/files#r281624640
# handle case where lambda is passed both line and pattern separated logs | ||
# if there is only a single log and that log does not start with multiline regex pattern | ||
# assume that these are line separated logs, not pattern separated | ||
if len(split_data) <= 1 and not multiline_regex_start_pattern.match(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably remove this check and change this check https://github.com/DataDog/datadog-serverless-functions/pull/111/files#diff-618731ddf1c446ef45d09613cb4b189bR287 to:
if DD_MULTILINE_LOG_REGEX_PATTERN and multiline_regex_start_pattern.match(data):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also I think that we should change a bit multiline_regex_start_pattern
to make sure it starts from the beginning i.e.:
multiline_regex = re.compile("^{}".format(DD_MULTILINE_LOG_REGEX_PATTERN))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated this to clean up the nested if statements. Fwiw, I think .match
implicitly ensures the pattern starts from the beginning of the string, but i agree it's good to explicitly put this in the regular expression, updated to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not test this functionally but if this is working fine, let's 🚢 it.
What does this PR do?
Adds multiline log support for s3 logs that are multiline
Motivation
S3 log buckets that have multiline logs can now be supported similar to how log_processing_rules work in other portions of the Datadog ecosystem. I have seen a few gists/hacks floating around on modifying the Lambda to handle multiline logs, however given the KMS support this lambda offers and that it is regularly updated+maintained, I figured it would make more sense to add an optional configuration for this lambda rather than support an entirely different multiline log lambda
Additional Notes
I don't believe it would be possible to handle multiline logs from cloudwatch with this approach, but if I am mistaken, i would be happy to try to generalize this to other sources (such as cloudwatch) if there was consensus that was achievable