-
Notifications
You must be signed in to change notification settings - Fork 4.2k
fix(s3-deployment): optimize memory usage for large files #34020
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The S3 deployment Lambda handler was reading entire files into memory to check if they're JSON, causing timeouts and memory issues with large files (10MB+). This change improves memory efficiency by: 1. Using file extension to determine if a file is JSON instead of reading its content 2. Only loading the entire file for JSON processing when necessary 3. Maintaining special handling for JSON files with double quotes in markers Performance testing shows: - Standard operations stay under 32MB memory usage - Even complex JSON files with double quotes stay under 256MB - Processing time is comparable to the previous implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This review is outdated)
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #34020 +/- ##
=======================================
Coverage 83.98% 83.98%
=======================================
Files 120 120
Lines 6976 6976
Branches 1178 1178
=======================================
Hits 5859 5859
Misses 1005 1005
Partials 112 112
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Exemption Request: The issue involves memory usage with large files (10MB+), and adding integ test would require to add such files to the repository. Instead, I've implemented comprehensive local testing with Docker-based isolated test runs that measure memory usage across various file types and sizes. This approach provides more targeted performance metrics than a standard integration test could, as it includes specific memory instrumentation that captures the exact behavior being fixed. The detailed performance metrics included in this PR demonstrate the effectiveness of the solution without needing to commit large test files to the repository. |
$DOCKER_CMD build . | ||
$DOCKER_CMD build --no-cache -t s3-deployment-test . | ||
|
||
$DOCKER_CMD run --rm s3-deployment-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How and where/when does this test get run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are manually run by the developer when testing their changes on the custom resource's code.
I also ran into the memory issue and we don't use any markers. I almost did a PR until I saw this one. I would like to suggest to move the check for no markers from # extract archive and replace markers in output files
def extract_and_replace_markers(archive, contents_dir, markers):
with ZipFile(archive, "r") as zip:
zip.extractall(contents_dir)
# replace markers for this source if there are any markers
if markers:
for file in zip.namelist():
file_path = os.path.join(contents_dir, file)
if os.path.isdir(file_path): continue
replace_markers(file_path, markers) |
…memory usage This change addresses issue aws#34002 where the S3 deployment Lambda function experiences memory issues with large JSON files. The fix: - Adds a new `jsonAwareSourceProcessing` boolean property to BucketDeploymentProps - Implements efficient marker detection using line-by-line processing - Optimizes memory usage by only loading entire files when necessary - Updates tests to verify both processing modes - Updates documentation with usage examples and memory considerations By default, the property is set to false for backward compatibility.
✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.
That's a valid proposition. However, I want to be able to test the changes without having to create a zip file. So I will keep the test at the beginning of |
chunk += ` "name": "Item ${lineNum}",\n`; | ||
chunk += ` "hash": "${crypto.createHash('sha256').update(lineNum.toString()).digest('hex')}",\n`; | ||
chunk += ' "properties": {\n'; | ||
chunk += ` "description": "${lineContent.replace(/"/g, '\\"')}",\n`; |
Check failure
Code scanning / CodeQL
Incomplete string escaping or encoding High test
chunk += ` }${bytesWritten + chunk.length + 6 < totalBytes ? ',\n' : '\n'}`; | ||
} else { | ||
// Simple items for the rest | ||
chunk += ` { "id": "${lineNum}", "value": "${lineContent.replace(/"/g, '\\"')}" }${bytesWritten + chunk.length + 6 < totalBytes ? ',\n' : '\n'}`; |
Check failure
Code scanning / CodeQL
Incomplete string escaping or encoding High test
We are now using a safe json value for marker replacement and avoid loading the complete JSON file in memory. This change also add a way to trigger the new processing when creating the source data. This now limit the processing at the asset level instead of the deployment level.
…roduction rollback
...s/@aws-cdk/custom-resource-handlers/lib/aws-s3-deployment/bucket-deployment-handler/index.py
Outdated
Show resolved
Hide resolved
...ramework-integ/test/aws-codepipeline-actions/test/integ.cross-account-pipeline-cfn-action.ts
Show resolved
Hide resolved
@@ -197,7 +197,7 @@ | |||
} | |||
}, | |||
"flattenResponse": "false", | |||
"salt": "1743042983138" | |||
"salt": "1744306931931" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is set by the assertions framework:
salt: Date.now().toString(), |
@@ -1,28 +1,29 @@ | |||
{ | |||
"version": "39.0.0", | |||
"version": "41.0.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this changing as expected? If so, why?
I want to make sure this PR is not unintentionally updating some dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this version is updated every time we update an integ test snapshot. But I cannot tell where its value is comming from.
@@ -41,7 +41,7 @@ new CfnOutput(stack, 'SecretValue', { value: tokenizedValue }); | |||
// Add new file with secret value that needs proper escaping | |||
const file6 = Source.jsonData('my-json/secret-config.json', { | |||
secret_value: tokenizedValue, // Using the tokenized value explicitly | |||
}); | |||
}, true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 points:
-
I think we should take an configuration/options object instead of a boolean, like so:
const file6 = Source.jsonData('my-json/secret-config.json', { secret_value: tokenizedValue, // Using the tokenized value explicitly }, {escape: true }) // <-- probably need a better name than just `escape`?
This is so that in case we have more options or flags in the future, we do not have a long parameter list that may be confusing.
-
We need a test for the default behaviour as well (which I believe is no escaping).
|
||
// Log file size for verification | ||
const stats = fs.statSync(filePath); | ||
process.stderr.write(`Generated ${filePath} with size: ${(stats.size / (1024 * 1024)).toFixed(2)}MB\n`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we write to stderr
? Same comment to the similar lines at the bottom of this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for testing purpose, I will remove it since it doesn't show by default when running the integ test (I mean without the -vvv
option)
}); | ||
|
||
// Verify the content is valid JSON and properly escaped | ||
assertionProvider.expect(ExpectedResult.objectLike({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is same test as the one in packages/@aws-cdk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-data.ts
(higher up at this line)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
content = file.read().rstrip() | ||
jsonObject = json.loads(content) # valid json | ||
self.assertEqual(content, '{"root": {"key_with_quote": "prefixvalue\\"with\\"quotessuffix", "key_without_quote": "prefixboom-marker2-replacedsuffix"}}') | ||
for json_aware_source_processing in [True, False]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would recommend breaking this test into two so it is easier to understand.
This change enhances the Source.jsonData() method by replacing the boolean 'escape' parameter with a structured JsonProcessingOptions interface. This provides a more maintainable and extensible API that can accommodate additional JSON processing options in the future. The change: - Adds a new JsonProcessingOptions interface with an 'escape' property - Updates the jsonData method signature to accept this interface - Updates all references in tests and documentation to use the new interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving assuming the automatic checks will pass.
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
Comments on closed issues and PRs are hard for our team to see. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pull request linter fails with the following errors:
❌ Breaking changes in stable modules [s3-deployment] is disallowed.
If you believe this pull request should receive an exemption, please comment and provide a justification. A comment requesting an exemption should contain the text Exemption Request
. Additionally, if clarification is needed, add Clarification Request
to a comment.
✅ A exemption request has been requested. Please wait for a maintainer's review.
Issue # (if applicable)
Closes #34002.
Reason for this change
The original fix for issue #22661 (PR #33698) introduced a regression where the S3 deployment Lambda would read entire files into memory to check if they're JSON. This approach works fine for small files but causes Lambda timeouts and memory issues with large files (10MB+). This is particularly problematic for customers deploying large assets to S3 buckets.
Description of changes
The S3 deployment Lambda handler was reading entire files into memory to check if they're JSON, causing timeouts and memory issues with large files (10MB+).
This change optimizes the S3 deployment Lambda handler to process files more efficiently by:
escape
parameter to theSource.jsonData
method in order to control JSON escapingjsonEscape
flag inMarkersConfig
to control when special JSON escaping is neededThese changes ensure that:
The implementation is backward compatible with the experience before the PR #33698 was merged, as it maintains the existing behavior by default but provides an opt-in mechanism for JSON escaping when needed. The opt-in mechanism is required for users who were benefitting from the escaping mechanism introduced by the PR #33698.
Describe any new or updated permissions being added
No new or updated IAM permissions are required for this change.
Description of how you validated changes
integ.bucket-deployment-large-file.ts
) that reproduces the issue with large filesThe integration test specifically validates that large files (10MB+) can be successfully deployed without memory issues, ensuring the fix works in real-world scenarios.
Checklist
BREAKING CHANGE: The automatic JSON escaping behavior introduced in PR #33698 is now opt-in via the new
escape
parameter inSource.jsonData()
. Users who were relying on the automatic JSON escaping for handling special characters in JSON files will need to explicitly enable this behavior by passing{ escape: true }
as the third parameter.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license