fix(s3-deployment): optimize memory usage for large files #34020

scorbiere · 2025-04-02T17:26:18Z

Issue # (if applicable)

Reason for this change

The original fix for issue #22661 (PR #33698) introduced a regression where the S3 deployment Lambda would read entire files into memory to check if they're JSON. This approach works fine for small files but causes Lambda timeouts and memory issues with large files (10MB+). This is particularly problematic for customers deploying large assets to S3 buckets.

Description of changes

The S3 deployment Lambda handler was reading entire files into memory to check if they're JSON, causing timeouts and memory issues with large files (10MB+).

This change optimizes the S3 deployment Lambda handler to process files more efficiently by:

Adding an early return when there are no markers to replace
Processing all files line by line, which is much more memory-efficient than loading the full JSON in memory
Adding an optional escape parameter to the Source.jsonData method in order to control JSON escaping
Using the jsonEscape flag in MarkersConfig to control when special JSON escaping is needed

These changes ensure that:

Files without markers are processed instantly
Files with markers are processed line by line, minimizing memory usage
Special JSON escaping is only applied when explicitly requested

The implementation is backward compatible with the experience before the PR #33698 was merged, as it maintains the existing behavior by default but provides an opt-in mechanism for JSON escaping when needed. The opt-in mechanism is required for users who were benefitting from the escaping mechanism introduced by the PR #33698.

Describe any new or updated permissions being added

No new or updated IAM permissions are required for this change.

Description of how you validated changes

Created an integration test (integ.bucket-deployment-large-file.ts) that reproduces the issue with large files
Implemented local testing to verify the fix with both small and large files
Added memory limit assertions to ensure memory usage stays within acceptable bounds
Conducted performance testing with isolated test runs to measure memory usage across various file types and sizes

The integration test specifically validates that large files (10MB+) can be successfully deployed without memory issues, ensuring the fix works in real-world scenarios.

Checklist

My code adheres to the CONTRIBUTING GUIDE and DESIGN GUIDELINES

BREAKING CHANGE: The automatic JSON escaping behavior introduced in PR #33698 is now opt-in via the new escape parameter in Source.jsonData(). Users who were relying on the automatic JSON escaping for handling special characters in JSON files will need to explicitly enable this behavior by passing { escape: true } as the third parameter.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

The S3 deployment Lambda handler was reading entire files into memory to check if they're JSON, causing timeouts and memory issues with large files (10MB+). This change improves memory efficiency by: 1. Using file extension to determine if a file is JSON instead of reading its content 2. Only loading the entire file for JSON processing when necessary 3. Maintaining special handling for JSON files with double quotes in markers Performance testing shows: - Standard operations stay under 32MB memory usage - Even complex JSON files with double quotes stay under 256MB - Processing time is comparable to the previous implementation

aws-cdk-automation

(This review is outdated)

codecov · 2025-04-02T17:35:45Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.98%. Comparing base (74cbe27) to head (c34c6b9).
Report is 35 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main   #34020   +/-   ##
=======================================
  Coverage   83.98%   83.98%           
=======================================
  Files         120      120           
  Lines        6976     6976           
  Branches     1178     1178           
=======================================
  Hits         5859     5859           
  Misses       1005     1005           
  Partials      112      112

Flag	Coverage Δ
suite.unit	`83.98% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
packages/aws-cdk	`∅ <ø> (∅)`
packages/aws-cdk-lib/core	`83.98% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

scorbiere · 2025-04-02T17:41:46Z

Exemption Request: The issue involves memory usage with large files (10MB+), and adding integ test would require to add such files to the repository. Instead, I've implemented comprehensive local testing with Docker-based isolated test runs that measure memory usage across various file types and sizes. This approach provides more targeted performance metrics than a standard integration test could, as it includes specific memory instrumentation that captures the exact behavior being fixed. The detailed performance metrics included in this PR demonstrate the effectiveness of the solution without needing to commit large test files to the repository.

samson-keung · 2025-04-02T23:34:06Z

...s/@aws-cdk/custom-resource-handlers/test/aws-s3-deployment/bucket-deployment-handler/test.sh

-$DOCKER_CMD build .
+$DOCKER_CMD build --no-cache -t s3-deployment-test .
+
+$DOCKER_CMD run --rm s3-deployment-test


How and where/when does this test get run?

They are manually run by the developer when testing their changes on the custom resource's code.

rehos · 2025-04-03T06:52:37Z

I also ran into the memory issue and we don't use any markers. I almost did a PR until I saw this one.

I would like to suggest to move the check for no markers from replace_markers to extract_and_replace_markers.

# extract archive and replace markers in output files
def extract_and_replace_markers(archive, contents_dir, markers):
    with ZipFile(archive, "r") as zip:
        zip.extractall(contents_dir)

        # replace markers for this source if there are any markers
        if markers:
            for file in zip.namelist():
                file_path = os.path.join(contents_dir, file)
                if os.path.isdir(file_path): continue
                replace_markers(file_path, markers)

…memory usage This change addresses issue aws#34002 where the S3 deployment Lambda function experiences memory issues with large JSON files. The fix: - Adds a new `jsonAwareSourceProcessing` boolean property to BucketDeploymentProps - Implements efficient marker detection using line-by-line processing - Optimizes memory usage by only loading entire files when necessary - Updates tests to verify both processing modes - Updates documentation with usage examples and memory considerations By default, the property is set to false for backward compatibility.

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

scorbiere · 2025-04-07T19:35:38Z

@rehos

I would like to suggest to move the check for no markers from replace_markers to extract_and_replace_markers

That's a valid proposition. However, I want to be able to test the changes without having to create a zip file. So I will keep the test at the beginning of replace_markers.

...dk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-large-file.ts

+          chunk += `      "name": "Item ${lineNum}",\n`;
+          chunk += `      "hash": "${crypto.createHash('sha256').update(lineNum.toString()).digest('hex')}",\n`;
+          chunk += '      "properties": {\n';
+          chunk += `        "description": "${lineContent.replace(/"/g, '\\"')}",\n`;


...dk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-large-file.ts

+          chunk += `    }${bytesWritten + chunk.length + 6 < totalBytes ? ',\n' : '\n'}`;
+        } else {
+          // Simple items for the rest
+          chunk += `    { "id": "${lineNum}", "value": "${lineContent.replace(/"/g, '\\"')}" }${bytesWritten + chunk.length + 6 < totalBytes ? ',\n' : '\n'}`;


We are now using a safe json value for marker replacement and avoid loading the complete JSON file in memory. This change also add a way to trigger the new processing when creating the source data. This now limit the processing at the asset level instead of the deployment level.

…roduction rollback

...s/@aws-cdk/custom-resource-handlers/lib/aws-s3-deployment/bucket-deployment-handler/index.py

...ramework-integ/test/aws-codepipeline-actions/test/integ.cross-account-pipeline-cfn-action.ts

samson-keung · 2025-04-14T19:17:01Z

...ands.js.snapshot/awscdkcodepipelinecommandstestDefaultTestDeployAssert6662C09B.template.json

@@ -197,7 +197,7 @@
     }
    },
    "flattenResponse": "false",
-    "salt": "1743042983138"
+    "salt": "1744306931931"


Why is this changing?

This is set by the assertions framework:

aws-cdk/packages/@aws-cdk/integ-tests-alpha/lib/assertions/sdk.ts

Line 101 in c570d14

salt: Date.now().toString(),

samson-keung · 2025-04-14T19:21:10Z

...test/integ.dependencies-pnpm.js.snapshot/PnpmTestDefaultTestDeployAssert397EDF83.assets.json

@@ -1,28 +1,29 @@
 {
-  "version": "39.0.0",
+  "version": "41.0.0",


Is this changing as expected? If so, why?

I want to make sure this PR is not unintentionally updating some dependencies

I believe this version is updated every time we update an integ test snapshot. But I cannot tell where its value is comming from.

samson-keung · 2025-04-14T19:41:22Z

...@aws-cdk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-data.ts

@@ -41,7 +41,7 @@ new CfnOutput(stack, 'SecretValue', { value: tokenizedValue });
 // Add new file with secret value that needs proper escaping
 const file6 = Source.jsonData('my-json/secret-config.json', {
  secret_value: tokenizedValue, // Using the tokenized value explicitly
-});
+}, true);


2 points:

I think we should take an configuration/options object instead of a boolean, like so:

const file6 = Source.jsonData('my-json/secret-config.json', { secret_value: tokenizedValue, // Using the tokenized value explicitly }, {escape: true }) // <-- probably need a better name than just `escape`?

This is so that in case we have more options or flags in the future, we do not have a long parameter list that may be confusing.

We need a test for the default behaviour as well (which I believe is no escaping).

samson-keung · 2025-04-14T19:48:40Z

...dk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-large-file.ts

+
+  // Log file size for verification
+  const stats = fs.statSync(filePath);
+  process.stderr.write(`Generated ${filePath} with size: ${(stats.size / (1024 * 1024)).toFixed(2)}MB\n`);


Why do we write to stderr? Same comment to the similar lines at the bottom of this file

It was for testing purpose, I will remove it since it doesn't show by default when running the integ test (I mean without the -vvv option)

samson-keung · 2025-04-14T19:49:46Z

...dk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-large-file.ts

+});
+
+// Verify the content is valid JSON and properly escaped
+assertionProvider.expect(ExpectedResult.objectLike({


This is same test as the one in packages/@aws-cdk-testing/framework-integ/test/aws-s3-deployment/test/integ.bucket-deployment-data.ts (higher up at this line)?

samson-keung · 2025-04-14T19:56:20Z

...s/@aws-cdk/custom-resource-handlers/test/aws-s3-deployment/bucket-deployment-handler/test.py

-            content = file.read().rstrip()
-            jsonObject = json.loads(content) # valid json
-            self.assertEqual(content, '{"root": {"key_with_quote": "prefixvalue\\"with\\"quotessuffix", "key_without_quote": "prefixboom-marker2-replacedsuffix"}}')
+        for json_aware_source_processing in [True, False]:


nit: I would recommend breaking this test into two so it is easier to understand.

packages/aws-cdk-lib/aws-s3-deployment/README.md

This change enhances the Source.jsonData() method by replacing the boolean 'escape' parameter with a structured JsonProcessingOptions interface. This provides a more maintainable and extensible API that can accommodate additional JSON processing options in the future. The change: - Adds a new JsonProcessingOptions interface with an 'escape' property - Updates the jsonData method signature to accept this interface - Updates all references in tests and documentation to use the new interface

samson-keung

Approving assuming the automatic checks will pass.

aws-cdk-automation · 2025-04-15T00:31:45Z

AWS CodeBuild CI Report

CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
Commit ID: c36a6e1
Result: SUCCEEDED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

mergify · 2025-04-15T00:32:07Z

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

github-actions · 2025-04-15T00:32:24Z

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

aws-cdk-automation

The pull request linter fails with the following errors:

❌ Breaking changes in stable modules [s3-deployment] is disallowed.

If you believe this pull request should receive an exemption, please comment and provide a justification. A comment requesting an exemption should contain the text Exemption Request. Additionally, if clarification is needed, add Clarification Request to a comment.

✅ A exemption request has been requested. Please wait for a maintainer's review.

aws-cdk-automation requested a review from a team April 2, 2025 17:26

github-actions bot added bug This issue is a bug. p0 labels Apr 2, 2025

mergify bot added the contribution/core This is a PR that came from AWS. label Apr 2, 2025

aws-cdk-automation previously requested changes Apr 2, 2025

View reviewed changes

aws-cdk-automation added the pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. label Apr 2, 2025

Merge branch 'main' into issue_34002

a2670fa

samson-keung reviewed Apr 2, 2025

View reviewed changes

scorbiere added 2 commits April 7, 2025 11:12

Merge branch 'main' into issue_34002

c34541e

github-advanced-security bot found potential problems Apr 7, 2025

View reviewed changes

Update integ test snapshot

63209ec

shikha372 assigned samson-keung Apr 8, 2025

scorbiere and others added 7 commits April 10, 2025 09:24

Merge branch 'main' into issue_34002

1963fc4

Update snapshot

6d634ca

Update some snapshot and complete the jsonAwareSourceProcessing int…

eace5ef

…roduction rollback

Update snapshot

a44cf62

Update snapshot with dry-run

c570d14

Merge branch 'main' into issue_34002

c34c6b9

samson-keung requested changes Apr 14, 2025

View reviewed changes

samson-keung added pr-linter/no-exemption The requested exemption will not be granted to the PR linter result and removed pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. labels Apr 14, 2025

Update integ test snapshot and add a case without json escaping

095fba4

scorbiere marked this pull request as ready for review April 14, 2025 22:31

scorbiere added 4 commits April 14, 2025 15:56

Merge branch 'main' into issue_34002

02fdd80

Fix README

ecf172f

Fix README

2d102ac

Merge branch 'main' into issue_34002

c36a6e1

samson-keung approved these changes Apr 15, 2025

View reviewed changes

mergify bot merged commit 7d82072 into aws:main Apr 15, 2025
13 of 15 checks passed

github-actions bot locked as resolved and limited conversation to collaborators Apr 15, 2025

aws-cdk-automation requested changes Apr 15, 2025

View reviewed changes

fix(s3-deployment): optimize memory usage for large files #34020

fix(s3-deployment): optimize memory usage for large files #34020

Uh oh!

Conversation

scorbiere commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue # (if applicable)

Reason for this change

Description of changes

Describe any new or updated permissions being added

Description of how you validated changes

Checklist

Uh oh!

aws-cdk-automation left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

scorbiere commented Apr 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rehos commented Apr 3, 2025

Uh oh!

scorbiere commented Apr 7, 2025

Uh oh!

Check failure

Check failure

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

samson-keung left a comment

Choose a reason for hiding this comment

Uh oh!

aws-cdk-automation commented Apr 15, 2025

AWS CodeBuild CI Report

Uh oh!

mergify bot commented Apr 15, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 15, 2025

Uh oh!

aws-cdk-automation left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scorbiere commented Apr 2, 2025 •

edited

Loading

aws-cdk-automation left a comment •

edited

Loading

codecov bot commented Apr 2, 2025 •

edited

Loading