Skip to content

Conversation

ddelong
Copy link

@ddelong ddelong commented Apr 18, 2024

Problem

We converted our YARN clusters to use S3 for logs. In general, this worked, but when our users had large enough logs, two negative behaviors developed:

  1. Some specific node manager logs wouldn't load, erroring out with no logs available for that container.
  2. When we tried to use the yarn logs command, it would similarly fail.

At first, we assumed there was something wrong with the logs in S3, but in all cases, they were there, so we were able to directly access and work around this, but that wasn't a long term solution. Logs in the job history server indicated that the S3AFileSystem object was closed during active operation. Incorrect ordering of these operations was suspected, but when diagnosed what was triggering the close, it was the finalize() method of the S3A object.

This was triggered from the LogReader class, which constructs a S3A object in its constructor, but does not retain a reference outside it, keeping instead a stream object that uses a produced S3AFileSystem object. This turns usage of that LogReader object into a race against the garbage collector as it decides to finalize the S3A object, which in turn closes the S3AFileSystem object. This doesn't happen for LogWriter because it retains the FileContext containing the S3A object as a reference for the lifecycle of that object.

Solution

Similar to apache#5780, I'm going to try add a back reference to address the issue of the S3A object being garbage collected while the LogReader is still processing TFiles. I considered changing the logic of the LogReader to hold a reference to this object, but that's fragile when the real problem is the assumption that the S3A object won't get reaped while created S3AFileSystem objects spawned from it are still in use.

The only downside here is that if the S3AFileSystem objects are not closed properly, the finalize() will never happen for S3A to trigger cleanup. However, that's a diagnosable logic failure whereas the bug being fixed is unpredictable behavior and failure.

Testing

We deployed an updated version of this to our JHS and RM to test if both the log browsing and yarn logs CLI worked successfully with sufficiently large logs. It did. We haven't done long term memory impact assessment yet.

@ddelong ddelong changed the title Diagnose s3a and try back reference Add backreference to S3A to prevent premature finalize Apr 19, 2024
@ddelong ddelong marked this pull request as ready for review April 19, 2024 13:03
@ddelong ddelong requested a review from charlesconnell April 19, 2024 13:03
@ddelong ddelong self-assigned this Apr 19, 2024
Copy link

@cathturner cathturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! 👍

@vinayvija
Copy link

LGTM

@ddelong ddelong merged commit 2224a32 into hubspot-3.3.6 Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants