Skip to content

Conversation

vinayvija
Copy link

When using s3 as Hadoop File system we enconter resource changed on src file system error. The reason I think is that in S3 implementation file timestamps are changed which does not happen in HDFS. Timestamp doc

Tested this by deploying this branch in QA NA. The same jobs that failed with this error passed after this change.

Example error log

Application application_1722669525346_0001 failed 5 times due to AM Container for appattempt_1722669525346_0001_000005 exited with exitCode: -1000
Failing this attempt.Diagnostics: [2024-08-05 14:41:43.796]Resource s3a://hubspot-hadoop-fs-backfill-s3-h3-eu1-qa/mapred/staging/backfill-s3-h3-qa/Wkd60q0QYAePuklPUXszbRFkLCjyB03P/.staging/job_1722669525346_0001/libjars changed on src filesystem - expected: "2024-08-05T14:39:39.188+0000", was: "2024-08-05T14:41:43.775+0000", current time: "2024-08-05T14:41:43.775+0000"
java.io.IOException: Resource s3a://hubspot-hadoop-fs-backfill-s3-h3-eu1-qa/mapred/staging/backfill-s3-h3-qa/Wkd60q0QYAePuklPUXszbRFkLCjyB03P/.staging/job_1722669525346_0001/libjars changed on src filesystem - expected: "2024-08-05T14:39:39.188+0000", was: "2024-08-05T14:41:43.775+0000", current time: "2024-08-05T14:41:43.775+0000"
at org.apache.hadoop.yarn.util.FSDownload.verifyAndCopy(FSDownload.java:282)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:72)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:425)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:422)

@vinayvija vinayvija self-assigned this Aug 5, 2024
FileSystem sourceFs = sCopy.getFileSystem(conf);
FileStatus sStat = sourceFs.getFileStatus(sCopy);
if (sStat.getModificationTime() != resource.getTimestamp()) {
throw new IOException("Resource " + sCopy + " changed on src filesystem" +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this briefly, but outright disabling this check makes me uneasy. it's there for a reason and we might start introducing unknown behavior by allowing this. S3 does work differently here, but we should make the logic reflect that fact or find a way to make modification time work as expected.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason it's there is that it's using timestamp as a mechanism for checking if we have the right jar is in place to run the job and not something else has overwritten.

So if we get rid of timestamp check and instead do checksum comparison it would work better.

Copy link
Author

@vinayvija vinayvija Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check for skipping this for s3 file systems. Looking at some stack overflow threads they seem safe.

@vinayvija
Copy link
Author

Tested this in NA QA backfill-s3 cluster job ran fine.

@ddelong ddelong changed the title Removetscheck Remove Timestamp Check for S3 Aug 16, 2024
Copy link

@ddelong ddelong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safer and we can go with this.

@vinayvija vinayvija merged commit f3312a2 into hubspot-3.3.6 Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants