Skip to content

Conversation

@neinkeinkaffee
Copy link
Contributor

Retry the create/update operation for storage credentials if the error indicates that we may have to allow more time for IAM propagation. The timeout of 2 minutes could be further decreased, the current number is taken from corresponding timeouts for the IAM propagation case in the AWS Terraform provider.

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2022

Codecov Report

Merging #1454 (f08833a) into master (73d5b17) will decrease coverage by 0.40%.
The diff coverage is 98.07%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1454      +/-   ##
==========================================
- Coverage   90.12%   89.71%   -0.41%     
==========================================
  Files         126      129       +3     
  Lines       10218    10446     +228     
==========================================
+ Hits         9209     9372     +163     
- Misses        642      695      +53     
- Partials      367      379      +12     
Impacted Files Coverage Δ
common/http.go 98.01% <ø> (-0.01%) ⬇️
mws/resource_mws_credentials.go 86.95% <96.29%> (+2.95%) ⬆️
catalog/resource_storage_credential.go 100.00% <100.00%> (ø)
exporter/importables.go 85.31% <0.00%> (-5.92%) ⬇️
exporter/util.go 79.55% <0.00%> (-1.13%) ⬇️
pipelines/resource_pipeline.go 92.17% <0.00%> (-0.97%) ⬇️
sql/resource_sql_widget.go 87.50% <0.00%> (-0.70%) ⬇️
sql/resource_sql_endpoint.go 97.43% <0.00%> (-0.13%) ⬇️
mws/resource_mws_workspaces.go 90.15% <0.00%> (-0.12%) ⬇️
... and 14 more

return resource.RetryContext(ctx, IAMPropagationTimeout,
func() *resource.RetryError {
cerr := f()
if e, ok := cerr.(common.APIError); ok && e.StatusCode == 403 && strings.Contains(cerr.Error(), "IAM role") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make error message matching a bit more concrete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm now matching on a larger portion of the error message, maybe still not ideal, but hopefully a bit better than before

}

func waitIAMPropagation(ctx context.Context, c *common.DatabricksClient, f func() error) error {
return resource.RetryContext(ctx, IAMPropagationTimeout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a concept of “resource timeouts” that are configurable on instance levels. Take a look at sql endpoints, library, cluster, workspace resources to see how to add them. You’ll have to change method signature to accept that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now using the resource timeouts which I've set to 2 minutes. That should be fine for storage credential, I'm just wondering whether in general the IAM propagation timeout might potentially sometimes have to be more conservative than the general resource timeouts. If the IAM role takes more than 2 minutes than maybe something's really wrong, and the role isn't there or not properly configured. Whereas some resources may want to have longer timeouts for creation and update.

}.ToResource()
}

func waitIAMPropagation(ctx context.Context, c *common.DatabricksClient, f func() error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename method to include retry

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed :)

@neinkeinkaffee
Copy link
Contributor Author

I have moved the retryOnError method into http.go in the common package in order to reuse it for the MWS credential. I'm not sure if that's the best location. I'm also wondering whether it might be better to keep the value of the timeout duration for IAM propagation separate from the resource timeouts for creation and update, given that one might not want to retry on IAM error until the overall resource creation/update timeout limit is reached. It would have the additional benefit of not having to pass down the resource timeout.

@nfx
Copy link
Contributor

nfx commented Jul 23, 2022

@neinkeinkaffee I’m not sure if common package is the right place for it. Cost of refactoring/supporting it in the common package is higher than copy paste of approximately 10 lines. So far i see it used only two times (for IAM roles). This is really retry-and-wait for error that might be a valid error after one or two minutes. Common package does another kind of error retries, which are immediate. More than that, common package is being refactored into standalone SDK as we speak, so I don’t want more things added over there.

bottom line: these retries are great, but I think they belong to mws and catalog packages.

@neinkeinkaffee
Copy link
Contributor Author

Makes sense! I've moved the retry code back into the catalog and mws.

@oleksandr-gubchenko
Copy link

any updates on this?

@pietern pietern requested review from a team as code owners June 5, 2024 08:35
@pietern pietern requested review from hectorcast-db and removed request for a team June 5, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ISSUE] Cannot create databricks_mws_credentials because aws_iam not ready

4 participants