Introduce a Single Threaded Stack Trace Sampler #2341

tduncan · 2025-06-16T19:34:43Z

This PR introduces a StackTraceSampler that utilizes a single thread for taking samples of all threads associated with trace ids that have been selected for snapshotting. This new StackTraceSampler replaces the existing ScheduledExecutorStackTraceSampler implementation which used a 1:1 ratio of sampling thread to trace id approach.

…samples for all threads associated with traces that have been selected for snapshotting.

…aceSampler.

… of StackTrace instances.

…DaemonThreadStackTraceSampler.

…ging in PeriodicallyExportingStagingArea.

…eadStackTraceSampler.

.../src/main/java/com/splunk/opentelemetry/profiler/snapshot/DaemonThreadStackTraceSampler.java

laurit · 2025-06-26T08:34:00Z

.../src/main/java/com/splunk/opentelemetry/profiler/snapshot/DaemonThreadStackTraceSampler.java

+            } else if (command.action == Action.START) {
+              startSampling(command, traceThreads);
+            } else if (command.action == Action.STOP) {
+              stopSampling(command, traceThreads);
+            }


Scheduling the start and end events to a different thread means that you will not be able to accurately capture the start and end time of the trace. If you are unlucky then the trace taken for start/end could happen when thread is already servicing a different request. Would that be an issue?

This is how I understand the the hypothetical:

Trace 1 starts

Thread 1 is requested to be sampled for Trace 1 (SnapshotProfilingSpanProcessor)

Thread 1 is associated with Trace 1:Span 1 (ActiveSpanTraker)

Trace 1 ends

Thread 1 is unassociated with Trace1:Span 1 ('ActiveSpanTracker`)

Thread 1 is requested to no longer be sampled (SnapshotProfilingSpanProcessor)

Thread 1 added for sampling (PeriodicStackTraceSampler)

Thread 1 is removed from sampling (PeriodicStackTraceSampler)

In the above a sample would be taken at both steps 7 and 8, at which point Thread 1 is doing whatever else (maybe idle, maybe processing a new request). Once the sample is taken a StackTrace instance is created using Trace 1's trace id and either an invalid span id or a span id from the next trace.

This is a problem. I'll figure out how to model it in a test.

@laurit, I think I've sorted out this scenario with test ensureStartAndStopSamplesAreAssociatedWithCorrectTraceAndSpanId and doNotStageStackTraceWhenThreadNoLongerAssociatedWithSameTraceId in PeriodicStackTraceSamplerTest.

.../src/main/java/com/splunk/opentelemetry/profiler/snapshot/DaemonThreadStackTraceSampler.java

…ed for sampling.

…ampling thread.

… being profiled.

…PeriodicallyExportingStagingArea.

…age.

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java

laurit · 2025-07-07T12:20:18Z

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java

+        String traceId,
+        String spanId,
+        long currentTimestamp) {
+      Duration samplingPeriod = Duration.ofNanos(currentTimestamp - context.timestamp);


Since the context timestamp is updated both from the periodic thread and from the remove method you could probably end up with a negative period here. Imo you really shouldn't be using this for the source.event.period.

What would you suggest as an alternative? Ultimately the source.event.period is used as the stack frame execution time. The start and stop samples are very likely to be less than the periodic sampling period which keeps the total reported execution time relatively close to the reported span execution time.

I've switched the underlying field to an AtomicLong and added a check to prevent timestamps from the past from overwriting newer ones. If I understand the interaction correctly that should work. If the two sampling threads are so close that they are racing to set the field then the difference is probably negligible.

What would you suggest as an alternative? Ultimately the source.event.period is used as the stack frame execution time.

In https://github.com/signalfx/gdi-specification/blob/main/specification/semantic_conventions.md#logrecord-message-text-data-format-specific-attributes the description for source.event.period says

MUST contain the sampling period in milliseconds if this LogRecord represents a periodic event

now here we have a combination of periodic and non periodic activity which imo does not align with the existing description. Both source.event.period and source.event.time are with millisecond precision, is this enough for your use cases? Perhaps you should consider adding a new attribute. Or you could consider altering the description for the existing attribute.

I've switched the underlying field to an AtomicLong and added a check to prevent timestamps from the past from overwriting newer ones.

Lets assume that context timestamp is set to T0. When stop and periodic timer run at the same time they'll capture time as T1 and T2 respectively so that T1 < T2. Now lets assume that the periodic timer completes first for some reason and updates the context timestamp. Now the sampling period for stop is T2 - T1 which is a negative value.
I believe you could resolve this with some locking.

I've been able to reproduce this inconsistently with the existing test and indeed the sampling time is negative. What's interesting to is I've only been able to reproduce it in with the test when using a lock. Without locks (using the same code as in this diff) the test doesn't fail.

I have been able to force the scenario with breakpoints and manually advancing the two competing threads so it is possible.

I'm curious how you think a lock may be used here. I've experimented with a couple of solutions. I'd like to avoid locking the sampling mechanism; it doesn't seem ideal to hold up the periodic sampling thread because a new thread is being added but maybe that's the only way?

I'm curious how you think a lock may be used here.

If you lock around taking the end sample and taking the periodic sample then they can't run concurrently and see inconsistent state. For locking you could use a ReentrantReadWriteLock where end sample uses the shaded lock (read lock) and periodic sample use the exclusive lock (write lock). Or you could synchronize on the sampling context or add a lock into it if you wish to make it more fine grained. My understanding is that you need to ensure is that the periodic sample is finalized before the end sample is added or that the periodic sample is not added at all.

If you wish to avoid locking I guess you could keep state (e.g sample time) in SamplingContext. Before you take the sample you read the state and after you use cas to update it. If cas succeeds all is fine. If it does not when you are taking a periodic sample you can assume that end sample was already added and discard the periodic sample. If the cas fails for end sample you can retry taking the end sample (or recompute the sample period) or perhaps when the periodic sample is later than the end sample discard the end sample.

I'm not convinced we need both samples, and I don't think there is any preference as to which one to prioritize. If the two threads are racing to the point where one of the computed sampling periods is negative it MAY be OK to drop that sample. Both samples should be using the same previous sample timestamp so the "newer" sample would fully encompass the time range of the "older" sample.

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java

…ce sampling thread.

…ime value.

…id a sampling race condition between the periodic sampling thread and the removal thread.

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java

…th negative sampling periods from being staged.

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java

tduncan added 11 commits June 16, 2025 08:42

Add alternate StackTraceSampler that uses a single thread for taking …

62d6b8a

…samples for all threads associated with traces that have been selected for snapshotting.

Use DaemonThreadStackTraceSampler instead of ScheduledExecutorStackTr…

7c68131

…aceSampler.

Remove ScheduledExecutorStackTraceSampler.

9f87434

Minor refactoring.

0a5a345

Extend StagingArea interface to include method for staging Collection…

cea10b7

… of StackTrace instances.

Stage entire collection of stack traces rather than one at a time in …

d8eae04

…DaemonThreadStackTraceSampler.

Remove the now unused StagingArea method to stage a single StackTrace.

2c18749

Add warning log message if stack trace is unable to be queued for sta…

ccbb11c

…ging in PeriodicallyExportingStagingArea.

Remove shutdown flag.

ede1335

Apply spotless code formatting.

43b701a

Inline the method that accepted a single SamplingContext in DaemonThr…

867269d

…eadStackTraceSampler.

tduncan requested review from a team as code owners June 16, 2025 19:34

Merge branch 'main' into single-threaded-stack-trace-sampler

5533f3f

laurit reviewed Jun 26, 2025

View reviewed changes

tduncan added 4 commits June 27, 2025 09:12

Rename DaemonThreadStackTraceSampler to PeriodicStackTraceSampler.

1d8a953

Convert field to local variable.

c280a65

Avoid attempting to take stack trace samples if zero threads are mark…

1f27d7c

…ed for sampling.

Merge 'main' into 'single-threaded-stack-trace-sampler'.

97415a9

tduncan requested a review from a team as a code owner July 1, 2025 17:40

tduncan added 10 commits July 1, 2025 15:01

Take start and stop samples on span thread rather than the periodic s…

ff0fa48

…ampling thread.

Reorder methods.

8df50f0

Prevent stack traces from being staged for threads that are no longer…

3fccf39

… being profiled.

Merge branch 'main' into single-threaded-stack-trace-sampler

e0dfa96

Update thread name.

de7aacf

Remove debug System.out statements.

986aabd

Move staging single stack trace and failure logging into StagingArea/…

c537595

…PeriodicallyExportingStagingArea.

Remove unused import.

856db3b

Rename class.

129c719

Add comment explaining the purpose of ThreadInfoCollector.

5e81b22

Rollback changes to tests to use Collection version of StagingArea.st…

9418f12

…age.

laurit reviewed Jul 7, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Show resolved Hide resolved

laurit reviewed Jul 7, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

laurit reviewed Jul 7, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

laurit reviewed Jul 7, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

laurit reviewed Jul 7, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

tduncan added 10 commits July 7, 2025 09:44

Add explanatory comment.

c4e23eb

Mark timestamp field as volatile.

7622ee0

Use a CountDownLatch for signaling shutdown of the periodic stack tra…

5a7e074

…ce sampling thread.

Add comment explaining the active span may have changed.

84c9e78

Merge branch 'main' into single-threaded-stack-trace-sampler

dd25b0e

Inline method.

337c325

Rename method parameter to sampleTimestamp.

d3b1395

Rename fields and method parameters.

8109131

Prevent SamplingContext sample times from being updated to a 'past' t…

6d9248a

…ime value.

Take final stack trace sample after removing trace id from Map to avo…

86b35f4

…id a sampling race condition between the periodic sampling thread and the removal thread.

laurit reviewed Jul 8, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

tduncan added 3 commits July 8, 2025 09:51

Combine comment lines.

2393f7d

Lock individual SamplingContext instances and prevent stack traces wi…

81e8104

…th negative sampling periods from being staged.

Apply spotless code formatting.

e21c981

laurit reviewed Jul 10, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

laurit reviewed Jul 10, 2025

View reviewed changes

...iler/src/main/java/com/splunk/opentelemetry/profiler/snapshot/PeriodicStackTraceSampler.java Outdated Show resolved Hide resolved

laurit approved these changes Jul 10, 2025

View reviewed changes

tduncan added 2 commits July 10, 2025 08:48

Make sampleTime not volatile.

4ee6d70

Fix typos in comment.

9772a79

laurit merged commit 184f837 into signalfx:main Jul 10, 2025
28 checks passed

github-actions bot locked and limited conversation to collaborators Jul 10, 2025

tduncan deleted the single-threaded-stack-trace-sampler branch July 10, 2025 17:24

Introduce a Single Threaded Stack Trace Sampler #2341

Introduce a Single Threaded Stack Trace Sampler #2341

Uh oh!

Conversation

tduncan commented Jun 16, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!