Skip to content

[Logs] Improve retry logic to catch exceptions in the socket #24

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 24, 2018

Conversation

tmichelet
Copy link
Contributor

As we're sometimes seeing connections resets, or SSLErrors

As we're sometimes seeing connections resets, or SSLErrors
@tmichelet tmichelet requested review from NBParis and ajacquemot and removed request for ajacquemot January 18, 2018 10:39
Copy link
Contributor

@ajacquemot ajacquemot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments about exceptions handling and the socket life cycle.

for log in logs:
s = safe_submit_log(s, log)
except Exception as e:
print('Uncaught exception: {}'.format(str(e)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This exception is caught, you should change Uncaught to Unexpected. I would also print event for debugging purpose.

except Exception as e:
print('Uncaught exception: {}'.format(str(e)))
finally:
s.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should close this socket all the time because it's in the lambda_handler scope.

for log in logs:
send_entry(s, log)

logs = awslogs_handler(event)
except Exception as e:
# Logs through the socket the error
err_message = 'Error parsing the object. Exception: {}'.format(str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, I would print the event too.

try:
send_entry(s, log)
except Exception as e:
err_message = 'Error sending the log line. Exception: {}'.format(str(e))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would print the log here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually sending it to datadog!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant doing something like:

err_message = 'Exception: {} occured for log line: {}'.format(str(e), log)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feedback from customer: we should not log when an error happened but we were able to retry

Copy link

@mnshdw mnshdw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this will work reliably only if we can guarantee a single node can be unavailable at any given time.

When restarting instances, this should be the case if max_concurent_restart_percent is set to 0, which fortunately is its current value for intake nodes. It could be worth adding a note there to never change it because of the single retry in the lambda.

There are other scenarios where this might fail (eg. trying on node almost up, retrying on a node just down) but this should alleviate issues for a while.

It could be worth bumping retries to 2 to be really bulletproof ?

@tmichelet
Copy link
Contributor Author

@mnshdw would you mind explaining this?

Ok, this will work reliably only if we can guarantee a single node can be unavailable at any given time.

@mnshdw
Copy link

mnshdw commented Jan 18, 2018

@tmichelet What I meant is that if your retry happens on a node that is unavailable when you re-send the log (eg. it is being restarted), you'll face the same issue as before. This is unlikely but still possible to drop logs.

@tmichelet
Copy link
Contributor Author

We now have CW logs when we're unable to submit data, so if this happens people will be able to catch it. I'd rather merge it as is, knowing that we might not use TCP anymore in the near future

as otherwise we see errors in the logs, and this is confusing
If we successfully retry, then it is not an error
@tmichelet tmichelet merged commit d9f4726 into master Jan 24, 2018
@tmichelet tmichelet deleted the tristan/fix-logs-lambda-retry branch January 24, 2018 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants