Bugfix & Enhancement `GpuInfo` and `CpuInfo` to threaded implementation #820

DrStoop · 2020-03-02T11:07:41Z

As far as I understand, the current GpuInfo is a Metric attached to Events.ITERATION_COMPLETED, meaning it logs the current GPU utilization during its downtime when the model in not updated or inferring as Engine().process_function() is completed:

class Engine():
...
   def _run_once_on_dataset(..):
        ...
        self._fire_event(Events.ITERATION_STARTED)
        self.state.output = self._process_function(self, self.state.batch)
        self._fire_event(Events.ITERATION_COMPLETED)
        ...

Therefore the measurement is not representative for the actual GPU usage during training. Or did I miss anything?

An alternative quick-fix to the suggestion below for the current GpuInfo - at least for the memory logging - would be to replace the self.nvsmi.DeviceQuery("memory.used") by torch.cuda.max_memory_allocated() which would return the maximum used GPU-memory for each iteration (don't forget to also rest the method after logging).

Description:

The suggested GpuInfo and CpuInfo are both run on independent threads logging the hardware every user-defined time interval. This should basically lead to a random logging of GPU/CPU uptime and downtime usage which represents the actual use much better.

- bugfix of current `GpuInfo` which currently only logs GPU down time utilizations at `Events.ITERATION_COMPLETED` - adding CPU info logging

vfdev-5 · 2020-03-02T11:16:01Z

@DrStoop AFAIK torch.cuda.max_memory_allocated() does not show the same thing as nvidia-smi, so using pynmvl is IMO better in the way if we need to match what we see with nvidia-smi.

Concerning GPU utilization, yes probably it would tricky to catch it as a metric. Maybe there can be an option to request a mean value or something...

vfdev-5 · 2020-03-02T11:18:28Z

The suggested GpuInfo and CpuInfo are both run on independent threads logging the hardware every user-defined time interval. This should basically lead to a random logging of GPU/CPU uptime and downtime usage which represents the actual use much better.

Thanks for the PR, I think we need to compare both implementations to see the diffs.

Do we care about CPU usage ?

DrStoop · 2020-03-02T13:39:46Z

Do we care about CPU usage ?

Depends, most OSs have multiple live CPU-logger integrated anyway, that are good enough for debugging. So it's not the most important feature.

Generally CPU become a bottleneck e.g. for data pre-processing or data handling. If you're working in a pipeline with "live'-pre-processing, this may become the relevant bottleneck, or shoveling data around in a (unwanted/unrecognized) inefficient way... The combination of both GPU & CPU can be helpful to understand why the GPU is not used to capacity while the CPU is running single-cored at 101%.

If you're running batchwise pre-processing of data, this would probably be a necessary feature... but therefore one would have to first thing about a DataPreprossorEngine ;-) E.g. in the transfer-learning-conv-ai you mentioned on slack, the pre-processing is brutally single-cored & I rewrote it as multiprocessing/threading. I used it therefore. But to be honest, the OS-CPU-tracker would have been enough.

Nevertheless, the *Info runs the thread independently of the rest of the Engine, so you could start it even when only pre-processing (without a "live-pipe"), just needed to get the loop between the GpuInfo-timestamp & the engine iterations.

Conclusion: Maybe nice to have, no real necessity..

(Note: In the framework, it defaults to n_samples_ref of the first engine started in the state and initialized by changing a single state object e.g. state.default_config.logging_all_hardware_utilization = True, x_axis_ref can also be user-defined.)

vfdev-5 · 2020-03-02T14:07:40Z

Generally CPU become a bottleneck e.g. for data pre-processing or data handling. If you're working in a pipeline with "live'-pre-processing, this may become the relevant bottleneck, or shoveling data around in a (unwanted/unrecognized) inefficient way... The combination of both GPU & CPU can be helpful to understand why the GPU is not used to capacity while the CPU is running single-cored at 101%.

Yes, correct. Also I/O can be a bottleneck, such that GPU/CPU activities can be both ~0.

Nevertheless, the *Info runs the thread independently of the rest of the Engine, so you could start it even when only pre-processing (without a "live-pipe"), just needed to get the loop between the GpuInfo-timestamp & the engine iterations.

I see your point. Let me take a look at your code and I'll comment out...

DrStoop

Sorry, first bugs came to my mind... the pytest wasn't running for other reasons.

DrStoop · 2020-03-02T19:19:30Z

ignite/contrib/metrics/gpu_info.py

-            tb_logger.attach(trainer,
-                             log_handler=OutputHandler(tag="training", metric_names='all'),
-                             event_name=Events.ITERATION_COMPLETED)
+class GpuPynvmlLogger(Thread):


Already discovered the first bugs... the pytest wasn't running for other reasons:
GpuPynvmlLogger -> GpuInfo

DrStoop · 2020-03-02T19:20:05Z

ignite/contrib/metrics/gpu_info.py

-    def attach(self, engine, name="gpu", event_name=Events.ITERATION_COMPLETED):
-        engine.add_event_handler(event_name, self.completed, name)
+    def __init__(self, logger_directory, logger_name='GPULogger', log_interval_seconds=1, unit='GB'):
+        super(GpuPynvmlLogger, self).__init__(name=logger_name, daemon=True)


super(GpuPynvmlLogger, self) -> super()

DrStoop · 2020-03-02T19:21:40Z

ignite/contrib/metrics/gpu_info.py

+        # Close tensorboard logger
+        self._tb_logger.close()
+        # Join thread
+        self.join()


Forgot the attach method, e.g.:

def attach(self, engine, name="gpu", event_name_started=Events.STARTED, event_name_completed=Events.COMPLETED): engine.add_event_handler(event_name_started, self.start, name) engine.add_event_handler(event_name_completed, self.close,name)

DrStoop

Same issue as with gpu_info.py...

DrStoop · 2020-03-02T19:23:49Z

ignite/contrib/metrics/cpu_info.py

+        # Close tensorboard logger
+        self._tb_logger.close()
+        # Join thread
+        self.join()


Forgot the attach method, e.g.:

def attach(self, engine, name="gpu", event_name_started=Events.STARTED, event_name_completed=Events.COMPLETED): engine.add_event_handler(event_name_started, self.start, name) engine.add_event_handler(event_name_completed, self.close,name)

DrStoop and others added 2 commits March 2, 2020 12:14

GpuInfo on CpuInfo implemented on independent thread

b871248

- bugfix of current `GpuInfo` which currently only logs GPU down time utilizations at `Events.ITERATION_COMPLETED` - adding CPU info logging

autopep8 fix

d44dba1

DrStoop commented Mar 2, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Bugfix & Enhancement `GpuInfo` and `CpuInfo` to threaded implementation #820

Bugfix & Enhancement `GpuInfo` and `CpuInfo` to threaded implementation #820

Uh oh!

DrStoop commented Mar 2, 2020

Uh oh!

vfdev-5 commented Mar 2, 2020

Uh oh!

vfdev-5 commented Mar 2, 2020

Uh oh!

DrStoop commented Mar 2, 2020 •

edited

Loading

Uh oh!

vfdev-5 commented Mar 2, 2020

Uh oh!

DrStoop left a comment

Uh oh!

DrStoop Mar 2, 2020

Uh oh!

DrStoop Mar 2, 2020

Uh oh!

DrStoop Mar 2, 2020

Uh oh!

DrStoop left a comment

Uh oh!

DrStoop Mar 2, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Bugfix & Enhancement GpuInfo and CpuInfo to threaded implementation #820

Are you sure you want to change the base?

Bugfix & Enhancement GpuInfo and CpuInfo to threaded implementation #820

Uh oh!

Conversation

DrStoop commented Mar 2, 2020

Uh oh!

vfdev-5 commented Mar 2, 2020

Uh oh!

vfdev-5 commented Mar 2, 2020

Uh oh!

DrStoop commented Mar 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vfdev-5 commented Mar 2, 2020

Uh oh!

DrStoop left a comment

Choose a reason for hiding this comment

Uh oh!

DrStoop Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

DrStoop Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

DrStoop Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

DrStoop left a comment

Choose a reason for hiding this comment

Uh oh!

DrStoop Mar 2, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bugfix & Enhancement `GpuInfo` and `CpuInfo` to threaded implementation #820

Bugfix & Enhancement `GpuInfo` and `CpuInfo` to threaded implementation #820

DrStoop commented Mar 2, 2020 •

edited

Loading