-
Notifications
You must be signed in to change notification settings - Fork 3
Remove Timestamp Check for S3 #56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
FileSystem sourceFs = sCopy.getFileSystem(conf); | ||
FileStatus sStat = sourceFs.getFileStatus(sCopy); | ||
if (sStat.getModificationTime() != resource.getTimestamp()) { | ||
throw new IOException("Resource " + sCopy + " changed on src filesystem" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this briefly, but outright disabling this check makes me uneasy. it's there for a reason and we might start introducing unknown behavior by allowing this. S3 does work differently here, but we should make the logic reflect that fact or find a way to make modification time work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason it's there is that it's using timestamp as a mechanism for checking if we have the right jar is in place to run the job and not something else has overwritten.
So if we get rid of timestamp check and instead do checksum comparison it would work better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check for skipping this for s3 file systems. Looking at some stack overflow threads they seem safe.
Tested this in NA QA backfill-s3 cluster job ran fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is safer and we can go with this.
When using s3 as Hadoop File system we enconter resource changed on src file system error. The reason I think is that in S3 implementation file timestamps are changed which does not happen in HDFS. Timestamp doc
Tested this by deploying this branch in QA NA. The same jobs that failed with this error passed after this change.
Example error log