Skip to content

add multiline log support for s3 logs #111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 13, 2019

Conversation

ericmustin
Copy link
Contributor

What does this PR do?

Adds multiline log support for s3 logs that are multiline

Motivation

S3 log buckets that have multiline logs can now be supported similar to how log_processing_rules work in other portions of the Datadog ecosystem. I have seen a few gists/hacks floating around on modifying the Lambda to handle multiline logs, however given the KMS support this lambda offers and that it is regularly updated+maintained, I figured it would make more sense to add an optional configuration for this lambda rather than support an entirely different multiline log lambda

Additional Notes

I don't believe it would be possible to handle multiline logs from cloudwatch with this approach, but if I am mistaken, i would be happy to try to generalize this to other sources (such as cloudwatch) if there was consensus that was achievable

@NBParis NBParis requested a review from a team May 7, 2019 12:04
Copy link

@NBParis NBParis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for putting this together @ericmustin.

I'll let @DataDog/logs-intake do the final review before merging.

@@ -277,8 +281,14 @@ def s3_handler(event, context, metadata):
)
yield structured_line
else:
# Check if using multiline log regex pattern
if DD_MULTILINE_LOG_REGEX_PATTERN:
split_data = re.compile("(?<!^)\s+(?=%s)(?!.\s)" % (DD_MULTILINE_LOG_REGEX_PATTERN,)).split(data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when there is no match ? i.e. there is two different kind of inputs, a pattern separated one and a line separated one ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point, in the case where there is both types of s3 logs and the format varies, the line separated kind would not get split out into separate log lines. It assumes each log line, single or multi, begins with the same pattern at the start of the line ie (?<!^)\s+(?=<insert_pattern>)

Open to suggestions on how to approach this, I pushed up a somewhat hacky attempt to solve this (i think it is a solution at least ) which checks, when a multiline_regex env var is supplied, if there are any matches for that pattern ,and if there aren't then assumes it is line separated and calls .splitlines() as usual. My understanding is that every event
passed into s3_handler() would contain only 1 type, either pattern or line separated, but if that is not the case please let me know.

Also open to scrapping this we feel like it opens up a huge rabbit hole

Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# handle case where lambda is passed both line and pattern separated logs
# if there is only a single log and that log does not start with multiline regex pattern
# assume that these are line separated logs, not pattern separated
if len(split_data) <= 1 and not multiline_regex_start_pattern.match(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably remove this check and change this check https://github.com/DataDog/datadog-serverless-functions/pull/111/files#diff-618731ddf1c446ef45d09613cb4b189bR287 to:

if DD_MULTILINE_LOG_REGEX_PATTERN and multiline_regex_start_pattern.match(data):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I think that we should change a bit multiline_regex_start_pattern to make sure it starts from the beginning i.e.:

multiline_regex = re.compile("^{}".format(DD_MULTILINE_LOG_REGEX_PATTERN))

Copy link
Contributor Author

@ericmustin ericmustin May 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated this to clean up the nested if statements. Fwiw, I think .match implicitly ensures the pattern starts from the beginning of the string, but i agree it's good to explicitly put this in the regular expression, updated to reflect that.

Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not test this functionally but if this is working fine, let's 🚢 it.

@NBParis NBParis merged commit 35448ec into DataDog:master May 13, 2019
@ericmustin ericmustin deleted the add_multiline_log_support branch May 13, 2019 17:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants