-
Notifications
You must be signed in to change notification settings - Fork 4
SPC Bin Counters and mmap #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! Overall the patch looks good. There are some mishaps that I marked, and few pending questions but we're getting to the end of it. I do however, deplore a certain lack of documentation, and it will not be easy for follow up developers to continue the work here without that.
ompi/runtime/ompi_spc.c
Outdated
ompi_spc_value_t *bins; | ||
|
||
/* Denoted unlikely because counters will often be turned off. */ | ||
if( OPAL_UNLIKELY(IS_SPC_BIT_SET(ompi_spc_attached_event, event_id)) ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get this check outside the function and into the upper-level macro ? If we can, we can save the function call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've added the check into the macros.
opal/mca/btl/portals4/configure.m4
Outdated
@@ -11,7 +11,7 @@ | |||
# Copyright (c) 2004-2005 The Regents of the University of California. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be here ? Did you missed a merge or maybe you rebase according to OMPI master and the ICLDisco master is slightly behind ?
orte/mca/state/state.h
Outdated
@@ -3,6 +3,9 @@ | |||
* Copyright (c) 2011-2015 Los Alamos National Security, LLC. All rights |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no changes here.
orte/mca/state/orted/state_orted.c
Outdated
for(i = 0; i < node->num_procs; i++) { | ||
if(NULL != (pptr = (orte_proc_t*)opal_pointer_array_get_item(node->procs, i))) { | ||
if(fd[i] == -1) { | ||
rc = sprintf(sm_file, "%s" OPAL_PATH_SEP "spc_data.%s.%d.%d", shm_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am no sure I understand what this striking similar code is doing in the 2 cases. More precisely, why are you not doing a copy of the file for the first call?
ompi/runtime/ompi_spc.c
Outdated
} | ||
|
||
int rank = ompi_comm_rank(ompi_spc_comm), rc; | ||
char filename[64]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 might quickly become too short.
ompi/runtime/ompi_spc.c
Outdated
char filename[64]; | ||
|
||
if(ompi_mpi_spc_xml_string == NULL) { | ||
rc = sprintf(filename, "%s" OPAL_PATH_SEP "spc_data.%s.%d.%d.xml", shm_dir, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you set the length for the filename string please use snprintf
instead of sprintf
.
ompi/runtime/ompi_spc.c
Outdated
index = pvar->pvar_index - mpi_t_offset; | ||
if(index < 0) { | ||
char *shm_dir; | ||
if(0 == access("/dev/shm", W_OK)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use opal_shmem_mmap_backing_file_base_dir instead of the hardcoded /dev/shm ?
df83410
to
4b8f3e1
Compare
2bc1d39
to
a287bea
Compare
for(int i = 1; i < frag->num_segments; i++) { | ||
total_data += frag->segments[i].seg_len; | ||
} | ||
SPC_RECORD(OMPI_SPC_UNEXPECTED_QUEUE_DATA, -total_data); | ||
#endif | ||
SPC_RECORD(OMPI_SPC_UNEXPECTED_IN_QUEUE, -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alignment ?
orte/mca/state/orted/state_orted.c
Outdated
@@ -2,6 +2,9 @@ | |||
* Copyright (c) 2011-2017 Los Alamos National Security, LLC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this need to be moved somewhere else as there is no ORTE directory anymore.
bin counters and the mmap interface. Bin counters are designed to allow for more context in the counters so that one counter can have several sub-counters attached to it as bins with breakpoints to denote when each bin gets incremented. These are used extensively for collective algorithm counters which count the number of times each collective algorithm is called and under which circumstances small/large communicator/ message. The mmap interface creates a shared file using mmap so tools can attach to this and read SPC values directly using an XML file to point out where the data for each counter resides. This update also includes some bug fixes and quality of life improvements to the code. Signed-off-by: David Eberius <[email protected]>
are handled, adds new counters for internal queues such as counters for data usage and timers for out-of-sequence messages, and adds example code and documentation for SPCs. The watermark counter update reduces function call overhead by better merging the updates to the sentinel value and the watermark value. There are now three SPC example codes, one for accessing SPCs through MPI_T, one for accessing SPCs through the mmap interface, and one for parsing SPC snapshot datafiles and producing a heatmap plot of the counter value differences over time. This also adds documentation for SPCs in the form of a markdown file in the same directory as the SPC driver code. Signed-off-by: David Eberius <[email protected]>
favor of prrte. This requires an update to the SPC driver code and documentation and necessitates moving the SPC snapshot feature to another code region which will be in a separate pull request. Signed-off-by: David Eberius <[email protected]>
This is an update to the SPC code to add two major features: bin counters and the mmap interface. Bin counters are designed to allow for more context in the counters so that one counter can have several sub-counters attached to it as bins with breakpoints to denote when each bin gets incremented. These are used extensively for collective algorithm counters which count the number of times each collective algorithm is called and under which circumstances small/large communicator/message. The mmap interface creates a shared file using mmap so tools can attach to this and read SPC values directly using an XML file to point out where the data for each counter resides. This update also includes some bug fixes and quality of life improvements to the code.