Skip to content

Bug/220 spice makes bsk sims not thread safe #994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

Natsoulas
Copy link
Contributor

Description

The changes implement thread safety for the SPICE interface to address GitHub issue #220, which reported deadlocks and data corruption when running parallel Monte Carlo simulations.

This solution uses a mutex and reference counting approach to protect SPICE kernel loading and unloading operations (inspired by Juan's suggestion comment on the issue):

  1. Added a static mutex kernel_manipulation_mutex in the SpiceInterface class
  2. Added a static reference counter kernel_reference_counter to track kernel usage
  3. Modified loadSpiceKernel() to only load kernels once and use reference counting
  4. Modified unloadSpiceKernel() to only unload kernels when no longer needed
  5. Protected clearKeeper() with the mutex
  6. Updated the destructor to properly unload kernels

This ensures that when multiple simulations run in parallel, they won't encounter race conditions or deadlocks when accessing SPICE kernels.

Verification

The changes were verified with a dedicated thread safety unit test:

  • Added test_spiceThreadSafety.py unit test in the SPICE interface's _UnitTest directory
  • The test creates multiple SpiceInterface instances in parallel and forces them to simultaneously load/unload kernels
  • It verifies no deadlocks occur and no kernel files are corrupted
  • No existing functionality was broken - all other SPICE-related tests continue to pass

The test specifically reproduces the conditions described in issue #220.

Documentation

Added code comments explaining the thread safety mechanism in:

  • spiceInterface.h for the static mutex and reference counter
  • spiceInterface.cpp for the mutex implementation and kernel reference counting logic

No existing documentation was invalidated.

Future work

  1. Monitor for the release of NAIF's thread-safe SPICE implementation, which may provide a better solution in the future (Juan's advice from a comment in the issue 220 page).

@Natsoulas Natsoulas requested a review from schaubh May 7, 2025 05:36
@Natsoulas Natsoulas self-assigned this May 7, 2025
@Natsoulas Natsoulas requested a review from a team as a code owner May 7, 2025 05:36
@Natsoulas Natsoulas added bug Something isn't working enhancement New feature or request labels May 7, 2025
@Natsoulas Natsoulas linked an issue May 7, 2025 that may be closed by this pull request
@Natsoulas Natsoulas moved this to 🏗 In progress in Basilisk May 7, 2025
@Natsoulas Natsoulas force-pushed the bug/220-spice-makes-bsk-sims-not-thread-safe branch from 4a1af4d to aeed048 Compare May 7, 2025 06:10
@Natsoulas Natsoulas force-pushed the bug/220-spice-makes-bsk-sims-not-thread-safe branch 2 times, most recently from def2f18 to 8f6bf30 Compare May 9, 2025 03:35
@Natsoulas Natsoulas force-pushed the bug/220-spice-makes-bsk-sims-not-thread-safe branch from 8f6bf30 to a9ef12b Compare May 10, 2025 18:18
@Natsoulas Natsoulas moved this from 🏗 In progress to 👀 In review in Basilisk May 10, 2025
Copy link
Contributor

@schaubh schaubh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work! All tests passed on my system. Just some minor comments to address

import sys
import time
import hashlib
import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random is not used, please delete this import

import multiprocessing as mp
import pytest
import inspect
import numpy as np
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy is not used, please delete this import

import inspect
import numpy as np
import traceback
import signal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

signal is imported, if needed, later on. We can delete this import.

# Import Basilisk modules
from Basilisk import __path__
from Basilisk.simulation import spiceInterface
from Basilisk.utilities import unitTestSupport
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are not using unitTestSupport, we can delete this import

===========================================

This script tests the thread safety of the SPICE interface, specifically addressing
GitHub issue #220 where parallel simulations using SPICE were causing deadlocks and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this part of the documentation a hyperlink to issue 220?

@schaubh schaubh requested a review from juan-g-bonilla May 13, 2025 17:12
@schaubh
Copy link
Contributor

schaubh commented May 13, 2025

Nice work, looks like we are there. @juan-g-bonilla , as you commented on the original issue with the suggestion to use a mutex, would you have time to take a look at this code?

Copy link
Contributor

@juan-g-bonilla juan-g-bonilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done @Natsoulas ! Many of my comments are actually encouraging you to fix existing issues or even the original fix I proposed so we can get this class to its highest quality possible.

Thanks for the tag @schaubh .

Comment on lines +94 to +95
static std::mutex kernel_manipulation_mutex; //!< -- Mutex to protect SPICE kernel loading/unloading operations
static std::map<std::string, int> kernel_reference_counter; //!< -- Reference counter for SPICE kernels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably make these variables camelCase. (sorry I originally suggested snake_case!).

Comment on lines +94 to +95
static std::mutex kernel_manipulation_mutex; //!< -- Mutex to protect SPICE kernel loading/unloading operations
static std::map<std::string, int> kernel_reference_counter; //!< -- Reference counter for SPICE kernels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unordered_map is probably better here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, variables should be lower camel case.

@@ -40,7 +41,7 @@ class SpiceInterface: public SysModel {
public:
SpiceInterface();
~SpiceInterface();

void UpdateState(uint64_t CurrentSimNanos);
int loadSpiceKernel(char *kernelName, const char *dataPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should change these methods into:

int loadSpiceKernel(std::string_view kernelName, std::string_view dataPath);
int unloadSpiceKernel(std::string_view kernelName, std::string_view dataPath);

or at least std::string. There's no reason why we should be passing char*.

This will also save you a lot of headache on the implementation (see later comment).

Comment on lines +85 to +100
// Create vector instead of variable-length array for Windows compatibility
std::vector<char> kernelNameVec(this->charBufferSize);
char* kernelName = kernelNameVec.data();

// Unload the SPICE kernels in reverse order of loading
strcpy(kernelName, "de430.bsp");
unloadSpiceKernel(kernelName, this->SPICEDataPath.c_str());

strcpy(kernelName, "de-403-masses.tpc");
unloadSpiceKernel(kernelName, this->SPICEDataPath.c_str());

strcpy(kernelName, "pck00010.tpc");
unloadSpiceKernel(kernelName, this->SPICEDataPath.c_str());

strcpy(kernelName, "naif0012.tls");
unloadSpiceKernel(kernelName, this->SPICEDataPath.c_str());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the signature of unloadSpiceKernel (see previous comment), you'll be able to call:

unloadSpiceKernel("de430.bsp", this->SPICEDataPath);

Comment on lines +128 to +147
std::vector<char> kernelNameVec(this->charBufferSize);
char* kernelName = kernelNameVec.data();

// Load the required SPICE kernels - they will only be loaded once per kernel
// across all threads due to our reference counting mechanism
strcpy(kernelName, "naif0012.tls");
if(loadSpiceKernel(kernelName, this->SPICEDataPath.c_str())) {
bskLogger.bskLog(BSK_ERROR, "Unable to load %s", kernelName);
}
if(loadSpiceKernel((char *)"pck00010.tpc", this->SPICEDataPath.c_str())) {
bskLogger.bskLog(BSK_ERROR, "Unable to load %s", "pck00010.tpc");

strcpy(kernelName, "pck00010.tpc");
if(loadSpiceKernel(kernelName, this->SPICEDataPath.c_str())) {
bskLogger.bskLog(BSK_ERROR, "Unable to load %s", kernelName);
}
if(loadSpiceKernel((char *)"de-403-masses.tpc", this->SPICEDataPath.c_str())) {
bskLogger.bskLog(BSK_ERROR, "Unable to load %s", "de-403-masses.tpc");

strcpy(kernelName, "de-403-masses.tpc");
if(loadSpiceKernel(kernelName, this->SPICEDataPath.c_str())) {
bskLogger.bskLog(BSK_ERROR, "Unable to load %s", kernelName);
}
if(loadSpiceKernel((char *)"de430.bsp", this->SPICEDataPath.c_str())) {
bskLogger.bskLog(BSK_ERROR, "Unable to load %s", "de430.tpc");

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to refactor this into:

auto kernelNames = {
    "naif0012.tls", "pck00010.tpc", "de-403-masses.tpc", "de430.bsp"
};
for (auto kernelName : kernelNames)
{
    if(loadSpiceKernel(kernelName, this->SPICEDataPath)) {
        bskLogger.bskLog(BSK_ERROR, "Unable to load %s", kernelName);
    }
}

actually a similar refactor is also possible above for unloadSpiceKernels. You might even make the kernelNames a class-wide static variable so we don't need to update both loading and unloading methods to keep the kernels in sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this is a cleaner solution.

Comment on lines +476 to +486
// Create vectors instead of raw pointers for Windows compatibility
std::vector<char> fileNameVec(this->charBufferSize);
char* fileName = fileNameVec.data();

std::vector<char> nameVec(this->charBufferSize);
char* name = nameVec.data();

//! - The required calls come from the SPICE documentation.
//! - The most critical call is furnsh_c
strcpy(name, "REPORT");
erract_c("SET", this->charBufferSize, name);
// Create the full filepath
strcpy(fileName, dataPath);
strcat(fileName, kernelName);
furnsh_c(fileName);

//! - Check to see if we had trouble loading a kernel and alert user if so
strcpy(name, "DEFAULT");
erract_c("SET", this->charBufferSize, name);
delete[] fileName;
delete[] name;
if(failed_c()) {
return 1;
std::string filepath(fileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you refactor the inputs into std::string_view, then:

std::string filepath = std::string{dataPath} + std::string{kernelName);

Comment on lines +495 to +508
if (kernel_reference_counter.at(filepath) <= 0) {
// The required calls come from the SPICE documentation.
// The most critical call is furnsh_c
strcpy(name, "REPORT");
erract_c("SET", this->charBufferSize, name);
furnsh_c(fileName);

// Check to see if we had trouble loading a kernel
strcpy(name, "DEFAULT");
erract_c("SET", this->charBufferSize, name);

if(failed_c()) {
return 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job fetching the error from SPICE.

Comment on lines +528 to +537
std::vector<char> fileNameVec(this->charBufferSize);
char* fileName = fileNameVec.data();

//! - The required calls come from the SPICE documentation.
//! - The most critical call is furnsh_c
strcpy(name, "REPORT");
erract_c("SET", this->charBufferSize, name);
std::vector<char> nameVec(this->charBufferSize);
char* name = nameVec.data();

// Create the full filepath
strcpy(fileName, dataPath);
strcat(fileName, kernelName);
unload_c(fileName);
delete[] fileName;
delete[] name;
if(failed_c()) {
return 1;
std::string filepath(fileName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as above.

Comment on lines +579 to +580
std::vector<char> spiceOutputBufferVec(allowedOutputLength);
char* spiceOutputBuffer = spiceOutputBufferVec.data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you're using std::vector to handle the memory of the underlying char*. This might just be personal preference, but I think:

std::unique_ptr<char[]> spiceOutputBuffer(new char[allowedOutputLength]);

is a bit more expressive and safe. Just a thought.

Maybe actually we should make allowedOutputLength a constexpr size_t so that we can simply do:

char spiceOutputBuffer[allowedOutputLength];

and thus we get rid of nasty dynamically allocated memory.

@@ -0,0 +1,339 @@
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very comprehensive test, nicely done.

Just for own curiosity, have you run this test without your fix? We would want it to fail on that case, so that we're sure the test is actually correctly stressing the system. I remember this error is infamously tricky to reproduce, so we want to make sure this test is actually testing as we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

Spice makes BSK simulations not thread safe
3 participants