-
Notifications
You must be signed in to change notification settings - Fork 389
[Logs] Improve retry logic to catch exceptions in the socket #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
As we're sometimes seeing connections resets, or SSLErrors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments about exceptions handling and the socket life cycle.
Log/lambda_function.py
Outdated
for log in logs: | ||
s = safe_submit_log(s, log) | ||
except Exception as e: | ||
print('Uncaught exception: {}'.format(str(e))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This exception is caught, you should change Uncaught
to Unexpected
. I would also print event
for debugging purpose.
Log/lambda_function.py
Outdated
except Exception as e: | ||
print('Uncaught exception: {}'.format(str(e))) | ||
finally: | ||
s.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should close this socket all the time because it's in the lambda_handler
scope.
Log/lambda_function.py
Outdated
for log in logs: | ||
send_entry(s, log) | ||
|
||
logs = awslogs_handler(event) | ||
except Exception as e: | ||
# Logs through the socket the error | ||
err_message = 'Error parsing the object. Exception: {}'.format(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, I would print the event too.
Log/lambda_function.py
Outdated
try: | ||
send_entry(s, log) | ||
except Exception as e: | ||
err_message = 'Error sending the log line. Exception: {}'.format(str(e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would print the log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually sending it to datadog!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant doing something like:
err_message = 'Exception: {} occured for log line: {}'.format(str(e), log)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feedback from customer: we should not log when an error happened but we were able to retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this will work reliably only if we can guarantee a single node can be unavailable at any given time.
When restarting instances, this should be the case if max_concurent_restart_percent
is set to 0, which fortunately is its current value for intake nodes. It could be worth adding a note there to never change it because of the single retry in the lambda.
There are other scenarios where this might fail (eg. trying on node almost up, retrying on a node just down) but this should alleviate issues for a while.
It could be worth bumping retries to 2 to be really bulletproof ?
@mnshdw would you mind explaining this?
|
@tmichelet What I meant is that if your retry happens on a node that is unavailable when you re-send the log (eg. it is being restarted), you'll face the same issue as before. This is unlikely but still possible to drop logs. |
We now have CW logs when we're unable to submit data, so if this happens people will be able to catch it. I'd rather merge it as is, knowing that we might not use TCP anymore in the near future |
as otherwise we see errors in the logs, and this is confusing If we successfully retry, then it is not an error
As we're sometimes seeing connections resets, or SSLErrors