Skip to content

Tracking cache requests #1566

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 11, 2024
Merged

Tracking cache requests #1566

merged 10 commits into from
Apr 11, 2024

Conversation

betolink
Copy link
Contributor

@betolink betolink commented Apr 4, 2024

As discussed in #1527 this PR adds a few extra variables in the cache implementation to keep track of bytes requested vs cache hits. The numbers can be accessed directly or via the logs in debug mode.

This is also useful to benchmark the I/O behavior of the different cache implementations.

@betolink
Copy link
Contributor Author

betolink commented Apr 9, 2024

Hi @martindurant! I think this PR is ready for a review in case you have time this week.

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this looks good on the whole, particularly the set of tests.

I have some comments for you to consider.

sstart = i * self.blocksize
send = min(sstart + self.blocksize, self.size)
logger.debug(f"MMap get block #{i} ({sstart}-{send}")
self.total_requested_bytes += send - sstart
logger.debug(f"MMap get block #{i} ({sstart}-{send})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few of these (not your code, but shows up in the diff due to small corrections). Debug of the form ("test %s", arg) is preferred, so that the string doesn't need to be evaluated in the case that debug is not required.

part = self.cache[start:end]
if end > self.blocksize:
self.total_requested_bytes += end - self.blocksize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment: isn't the number of bytes returned by fetcher the more important number? If we request 100bytes, we might only get 10 back; probably the calling code will request more. OTOH, the various implementations of the request code might keep making new requests until enough bytes arrive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, perhaps we are missing another variable to keep track of the total_bytes_returned?

@@ -15,26 +21,124 @@ def test_cache_getitem(Cache_imp):


def test_block_cache_lru():
cache = BlockCache(4, letters_fetcher, len(string.ascii_letters), maxblocks=2)
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a comment, not a docstring

@martindurant
Copy link
Member

I can't fix your linter failure, since the PR is from master branch. You can run precommit locally, e.g.,

>>> pre-commit install
>>> pre-commit run -a

@betolink
Copy link
Contributor Author

Thanks @martindurant! I'll work on some minor changes now and will run linting locally before the next push. About modifying code on external forks, it's possible! I think you could pull/edit the PR directly (as long as the PR author checks the Allow edits and access to secrets by maintainers box.

@betolink
Copy link
Contributor Author

Ok now it should pass and I also addressed the comments on formatting the logs for cache implementations that do nothing @martindurant

@martindurant martindurant merged commit 05e7d80 into fsspec:master Apr 11, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants