-
Notifications
You must be signed in to change notification settings - Fork 472
[DRAFT] Retry storage credential create/update to allow for IAM propagation #1454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[DRAFT] Retry storage credential create/update to allow for IAM propagation #1454
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1454 +/- ##
==========================================
- Coverage 90.12% 89.71% -0.41%
==========================================
Files 126 129 +3
Lines 10218 10446 +228
==========================================
+ Hits 9209 9372 +163
- Misses 642 695 +53
- Partials 367 379 +12
|
| return resource.RetryContext(ctx, IAMPropagationTimeout, | ||
| func() *resource.RetryError { | ||
| cerr := f() | ||
| if e, ok := cerr.(common.APIError); ok && e.StatusCode == 403 && strings.Contains(cerr.Error(), "IAM role") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make error message matching a bit more concrete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now matching on a larger portion of the error message, maybe still not ideal, but hopefully a bit better than before
| } | ||
|
|
||
| func waitIAMPropagation(ctx context.Context, c *common.DatabricksClient, f func() error) error { | ||
| return resource.RetryContext(ctx, IAMPropagationTimeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s a concept of “resource timeouts” that are configurable on instance levels. Take a look at sql endpoints, library, cluster, workspace resources to see how to add them. You’ll have to change method signature to accept that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now using the resource timeouts which I've set to 2 minutes. That should be fine for storage credential, I'm just wondering whether in general the IAM propagation timeout might potentially sometimes have to be more conservative than the general resource timeouts. If the IAM role takes more than 2 minutes than maybe something's really wrong, and the role isn't there or not properly configured. Whereas some resources may want to have longer timeouts for creation and update.
| }.ToResource() | ||
| } | ||
|
|
||
| func waitIAMPropagation(ctx context.Context, c *common.DatabricksClient, f func() error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename method to include retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed :)
|
I have moved the retryOnError method into http.go in the common package in order to reuse it for the MWS credential. I'm not sure if that's the best location. I'm also wondering whether it might be better to keep the value of the timeout duration for IAM propagation separate from the resource timeouts for creation and update, given that one might not want to retry on IAM error until the overall resource creation/update timeout limit is reached. It would have the additional benefit of not having to pass down the resource timeout. |
|
@neinkeinkaffee I’m not sure if common package is the right place for it. Cost of refactoring/supporting it in the common package is higher than copy paste of approximately 10 lines. So far i see it used only two times (for IAM roles). This is really retry-and-wait for error that might be a valid error after one or two minutes. Common package does another kind of error retries, which are immediate. More than that, common package is being refactored into standalone SDK as we speak, so I don’t want more things added over there. bottom line: these retries are great, but I think they belong to mws and catalog packages. |
|
Makes sense! I've moved the retry code back into the catalog and mws. |
|
any updates on this? |
Retry the create/update operation for storage credentials if the error indicates that we may have to allow more time for IAM propagation. The timeout of 2 minutes could be further decreased, the current number is taken from corresponding timeouts for the IAM propagation case in the AWS Terraform provider.