-
-
Notifications
You must be signed in to change notification settings - Fork 656
Bugfix & Enhancement GpuInfo and CpuInfo to threaded implementation
#820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- bugfix of current `GpuInfo` which currently only logs GPU down time utilizations at `Events.ITERATION_COMPLETED` - adding CPU info logging
|
@DrStoop AFAIK Concerning GPU utilization, yes probably it would tricky to catch it as a metric. Maybe there can be an option to request a mean value or something... |
Thanks for the PR, I think we need to compare both implementations to see the diffs. Do we care about CPU usage ? |
Depends, most OSs have multiple live CPU-logger integrated anyway, that are good enough for debugging. So it's not the most important feature. Generally CPU become a bottleneck e.g. for data pre-processing or data handling. If you're working in a pipeline with "live'-pre-processing, this may become the relevant bottleneck, or shoveling data around in a (unwanted/unrecognized) inefficient way... The combination of both GPU & CPU can be helpful to understand why the GPU is not used to capacity while the CPU is running single-cored at 101%. If you're running batchwise pre-processing of data, this would probably be a necessary feature... but therefore one would have to first thing about a Nevertheless, the Conclusion: Maybe nice to have, no real necessity.. (Note: In the framework, it defaults to |
Yes, correct. Also I/O can be a bottleneck, such that GPU/CPU activities can be both ~0.
I see your point. Let me take a look at your code and I'll comment out... |
DrStoop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, first bugs came to my mind... the pytest wasn't running for other reasons.
| tb_logger.attach(trainer, | ||
| log_handler=OutputHandler(tag="training", metric_names='all'), | ||
| event_name=Events.ITERATION_COMPLETED) | ||
| class GpuPynvmlLogger(Thread): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already discovered the first bugs... the pytest wasn't running for other reasons:
GpuPynvmlLogger -> GpuInfo
| def attach(self, engine, name="gpu", event_name=Events.ITERATION_COMPLETED): | ||
| engine.add_event_handler(event_name, self.completed, name) | ||
| def __init__(self, logger_directory, logger_name='GPULogger', log_interval_seconds=1, unit='GB'): | ||
| super(GpuPynvmlLogger, self).__init__(name=logger_name, daemon=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super(GpuPynvmlLogger, self) -> super()
| # Close tensorboard logger | ||
| self._tb_logger.close() | ||
| # Join thread | ||
| self.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot the attach method, e.g.:
def attach(self, engine, name="gpu", event_name_started=Events.STARTED, event_name_completed=Events.COMPLETED):
engine.add_event_handler(event_name_started, self.start, name)
engine.add_event_handler(event_name_completed, self.close,name)
DrStoop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as with gpu_info.py...
| # Close tensorboard logger | ||
| self._tb_logger.close() | ||
| # Join thread | ||
| self.join() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot the attach method, e.g.:
def attach(self, engine, name="gpu", event_name_started=Events.STARTED, event_name_completed=Events.COMPLETED):
engine.add_event_handler(event_name_started, self.start, name)
engine.add_event_handler(event_name_completed, self.close,name)
As far as I understand, the current
GpuInfois aMetricattached toEvents.ITERATION_COMPLETED, meaning it logs the current GPU utilization during its downtime when the model in not updated or inferring asEngine().process_function()is completed:Therefore the measurement is not representative for the actual GPU usage during training. Or did I miss anything?
An alternative quick-fix to the suggestion below for the current
GpuInfo- at least for the memory logging - would be to replace theself.nvsmi.DeviceQuery("memory.used")bytorch.cuda.max_memory_allocated()which would return the maximum used GPU-memory for each iteration (don't forget to also rest the method after logging).Description:
The suggested
GpuInfoandCpuInfoare both run on independent threads logging the hardware every user-defined time interval. This should basically lead to a random logging of GPU/CPU uptime and downtime usage which represents the actual use much better.