Add backreference to S3A to prevent premature finalize #49
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
We converted our YARN clusters to use S3 for logs. In general, this worked, but when our users had large enough logs, two negative behaviors developed:
yarn logs
command, it would similarly fail.At first, we assumed there was something wrong with the logs in S3, but in all cases, they were there, so we were able to directly access and work around this, but that wasn't a long term solution. Logs in the job history server indicated that the
S3AFileSystem
object was closed during active operation. Incorrect ordering of these operations was suspected, but when diagnosed what was triggering the close, it was thefinalize()
method of theS3A
object.This was triggered from the
LogReader
class, which constructs aS3A
object in its constructor, but does not retain a reference outside it, keeping instead a stream object that uses a producedS3AFileSystem
object. This turns usage of thatLogReader
object into a race against the garbage collector as it decides to finalize theS3A
object, which in turn closes theS3AFileSystem
object. This doesn't happen forLogWriter
because it retains theFileContext
containing theS3A
object as a reference for the lifecycle of that object.Solution
Similar to apache#5780, I'm going to try add a back reference to address the issue of the
S3A
object being garbage collected while theLogReader
is still processing TFiles. I considered changing the logic of theLogReader
to hold a reference to this object, but that's fragile when the real problem is the assumption that the S3A object won't get reaped while created S3AFileSystem objects spawned from it are still in use.The only downside here is that if the
S3AFileSystem
objects are not closed properly, thefinalize()
will never happen for S3A to trigger cleanup. However, that's a diagnosable logic failure whereas the bug being fixed is unpredictable behavior and failure.Testing
We deployed an updated version of this to our JHS and RM to test if both the log browsing and
yarn logs
CLI worked successfully with sufficiently large logs. It did. We haven't done long term memory impact assessment yet.