[Logs] Improve retry logic to catch exceptions in the socket #24

tmichelet · 2018-01-18T10:38:33Z

As we're sometimes seeing connections resets, or SSLErrors

ajacquemot

I left some comments about exceptions handling and the socket life cycle.

ajacquemot · 2018-01-18T10:46:27Z

Log/lambda_function.py

+        for log in logs:
+            s = safe_submit_log(s, log)
+    except Exception as e:
+        print('Uncaught exception: {}'.format(str(e)))


This exception is caught, you should change Uncaught to Unexpected. I would also print event for debugging purpose.

ajacquemot · 2018-01-18T10:52:02Z

Log/lambda_function.py

+    except Exception as e:
+        print('Uncaught exception: {}'.format(str(e)))
+    finally:
+        s.close()


I think you should close this socket all the time because it's in the lambda_handler scope.

ajacquemot · 2018-01-18T10:52:40Z

Log/lambda_function.py

-        for log in logs:
-            send_entry(s, log)
-
+            logs = awslogs_handler(event)
    except Exception as e:
        # Logs through the socket the error
        err_message = 'Error parsing the object. Exception: {}'.format(str(e))


Same here, I would print the event too.

ajacquemot · 2018-01-18T10:53:08Z

Log/lambda_function.py

+    try:
+        send_entry(s, log)
+    except Exception as e:
+        err_message = 'Error sending the log line. Exception: {}'.format(str(e))


I would print the log here.

We're actually sending it to datadog!

I meant doing something like:

err_message = 'Exception: {} occured for log line: {}'.format(str(e), log)

Feedback from customer: we should not log when an error happened but we were able to retry

mnshdw

Ok, this will work reliably only if we can guarantee a single node can be unavailable at any given time.

When restarting instances, this should be the case if max_concurent_restart_percent is set to 0, which fortunately is its current value for intake nodes. It could be worth adding a note there to never change it because of the single retry in the lambda.

There are other scenarios where this might fail (eg. trying on node almost up, retrying on a node just down) but this should alleviate issues for a while.

It could be worth bumping retries to 2 to be really bulletproof ?

tmichelet · 2018-01-18T11:23:35Z

@mnshdw would you mind explaining this?

Ok, this will work reliably only if we can guarantee a single node can be unavailable at any given time.

mnshdw · 2018-01-18T15:03:06Z

@tmichelet What I meant is that if your retry happens on a node that is unavailable when you re-send the log (eg. it is being restarted), you'll face the same issue as before. This is unlikely but still possible to drop logs.

tmichelet · 2018-01-22T11:12:40Z

We now have CW logs when we're unable to submit data, so if this happens people will be able to catch it. I'd rather merge it as is, knowing that we might not use TCP anymore in the near future

as otherwise we see errors in the logs, and this is confusing If we successfully retry, then it is not an error

[Logs] Improve retry logic to catch exceptions in the socket

886bfc2

As we're sometimes seeing connections resets, or SSLErrors

tmichelet requested review from NBParis and ajacquemot and removed request for ajacquemot January 18, 2018 10:39

ajacquemot reviewed Jan 18, 2018

View reviewed changes

mnshdw reviewed Jan 18, 2018

View reviewed changes

[Logs] Print more information when catching an exception

75768b7

[Logs] Avoid logging an error when we successfully retry submission

19fd398

as otherwise we see errors in the logs, and this is confusing If we successfully retry, then it is not an error

NBParis approved these changes Jan 24, 2018

View reviewed changes

tmichelet merged commit d9f4726 into master Jan 24, 2018

tmichelet deleted the tristan/fix-logs-lambda-retry branch January 24, 2018 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Logs] Improve retry logic to catch exceptions in the socket #24

[Logs] Improve retry logic to catch exceptions in the socket #24

Uh oh!

tmichelet commented Jan 18, 2018

Uh oh!

ajacquemot left a comment

Uh oh!

ajacquemot Jan 18, 2018

Uh oh!

ajacquemot Jan 18, 2018

Uh oh!

ajacquemot Jan 18, 2018

Uh oh!

ajacquemot Jan 18, 2018

Uh oh!

tmichelet Jan 18, 2018

Uh oh!

ajacquemot Jan 18, 2018

Uh oh!

tmichelet Jan 22, 2018

Uh oh!

mnshdw left a comment •

edited

Loading

Uh oh!

tmichelet commented Jan 18, 2018

Uh oh!

mnshdw commented Jan 18, 2018 •

edited

Loading

Uh oh!

tmichelet commented Jan 22, 2018

Uh oh!

Uh oh!

[Logs] Improve retry logic to catch exceptions in the socket #24

[Logs] Improve retry logic to catch exceptions in the socket #24

Uh oh!

Conversation

tmichelet commented Jan 18, 2018

Uh oh!

ajacquemot left a comment

Choose a reason for hiding this comment

Uh oh!

ajacquemot Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

ajacquemot Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

ajacquemot Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

ajacquemot Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

tmichelet Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

ajacquemot Jan 18, 2018

Choose a reason for hiding this comment

Uh oh!

tmichelet Jan 22, 2018

Choose a reason for hiding this comment

Uh oh!

mnshdw left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tmichelet commented Jan 18, 2018

Uh oh!

mnshdw commented Jan 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmichelet commented Jan 22, 2018

Uh oh!

Uh oh!

mnshdw left a comment •

edited

Loading

mnshdw commented Jan 18, 2018 •

edited

Loading