Skip to content

Conversation

@falko17
Copy link

@falko17 falko17 commented Dec 11, 2025

This adds a _hash_cache to repos to store computed hashes and metadata in memory when a command is run. This way, the hash does not need to be recomputed in every step and _build can be skipped (at least for dependencies, outputs obviously need to update the cache).

It also adds some tests for the basic cases (hash can be used, hash must be overwritten, and hash must be reset when repo is reset).

@github-project-automation github-project-automation bot moved this to Backlog in DVC Dec 11, 2025
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ falko17
❌ MeganKW
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov
Copy link

codecov bot commented Dec 11, 2025

Codecov Report

❌ Patch coverage is 95.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 91.01%. Comparing base (2431ec6) to head (b2e55e6).
⚠️ Report is 164 commits behind head on main.

Files with missing lines Patch % Lines
dvc/output.py 84.61% 0 Missing and 2 partials ⚠️
dvc/repo/__init__.py 75.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10927      +/-   ##
==========================================
+ Coverage   90.68%   91.01%   +0.33%     
==========================================
  Files         504      505       +1     
  Lines       39795    41027    +1232     
  Branches     3141     3248     +107     
==========================================
+ Hits        36087    37340    +1253     
- Misses       3042     3047       +5     
+ Partials      666      640      -26     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@skshetry
Copy link
Collaborator

skshetry commented Dec 12, 2025

What are you trying to optimize for? Are you trying to reduce multiple build() calls within a single stage run, or are you trying to reduce for a complete repro session?

If former, I think we should reorganize/rearchitect so that we don't make duplicated calls, avoiding the need for caching.

If you are optimizing for the whole session, it may not be safe because it could have been modified, maybe by the user, or a different dvc process executing separately.
(I don't think your code even works for that scenario because we reset the repo after each run).

DVC does not re-hash files, so successive build() will only stat the directory/files, unless their mtime/inode/size changes. Which should be fast enough on modern machines. If stat is slow, then everything will be slower for you.

If you can share the perf numbers (before and after), then that would be great. But I'd avoid adding caching as much as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants