mcp: add retry and replay to the Streamable HTTP implementation #86

samthanawalla · 2025-07-02T15:55:05Z

Adds exponential backoff and jitter to the client-side POST. Implements replay using Last-Event-ID for resumability.

Since these concepts are intertwined- I have added this in a single CL.

Adds exponential backoff and jitter to the client-side POST. Implements replay using Last-Event-ID for resumability. Since these concepts are intertwined- I have added this in a single CL.

findleyr · 2025-07-02T19:59:28Z

mcp/streamable.go

 	"fmt"
 	"io"
+	"math/rand"


Use math/rand/v2
(did an LLM help write this? We need to teach them about math/rand/v2!)

findleyr · 2025-07-02T20:00:46Z

mcp/streamable.go

@@ -602,6 +615,13 @@ func NewStreamableClientTransport(url string, opts *StreamableClientTransportOpt
 	t := &StreamableClientTransport{url: url}
 	if opts != nil {
 		t.opts = *opts
+	} else {


No need for this: opts are already the zero value.

findleyr · 2025-07-02T20:08:09Z

mcp/streamable.go

+		s.mu.Lock()
+		defer s.mu.Unlock()
+		if s.err != nil {
+			return nil, s.err // Return explicit error if connection closed due to error


This looks like a logically distinct change. Given the size and complexity of this PR, it would be good to separate out those changes that are unrelated to replay support into one or more separate CLs, with tests.

findleyr · 2025-07-02T20:10:24Z

mcp/streamable.go

+						if currentSessionID == "" && gotSessionID != "" {
+							s.sessionID.Store(gotSessionID)
+						}
+						// Undefined behavior when currentSessionID != gotSessionID


What is the rationale for undefined behavior here? Why isn't this an error?

findleyr · 2025-07-02T20:17:37Z

mcp/streamable.go

+						// Continue
+					}
+
+					gotSessionID, sendErr := s.postMessage(ctx, currentSessionID, msgToSend)


I think there's a fundamental problem here (fixable, but fundamental): the spec says we can resume an sse stream by issuing a subsequent GET requests, but does not say that we should retry post requests.

Imagine that the server-side is a stateful server (rare but possible): we don't want to perform duplicate actions. Instead, we should only resume hanging requests once we've received the response header as well as a nonempty event ID. Then we can issue a GET with that last-event-id to resume the streaming of responses.

jba · 2025-07-07T20:42:47Z

mcp/streamable.go

+		case <-s.done:
+			return // Connection is closed
+		case msg := <-s.pendingMessages:
+			// Use a new context for each send attempt to allow individual retries to be cancelled


I don't understand this sentence. If the overall connection is cancelled, don't we want all the retries to be cancelled too? Why retain the ability to cancel each one separately?

jba · 2025-07-07T20:47:44Z

mcp/streamable.go

+
+					// Apply exponential backoff with jitter
+					backoffDuration := s.initialBackoff * time.Duration(1<<uint(i))
+					jitter := time.Duration(s.randSource.Int63n(int64(backoffDuration / 2))) // Jitter up to half of backoff


I've also seen jitter implemented as picking the delay randomly between zero and backoffDuration. I wonder what the tradeoffs are of each way. It probably doesn't matter, but the other one is slightly easier to implement.

jba · 2025-07-07T20:48:54Z

mcp/streamable.go

+		case <-s.done:
+			return // Connection is closed.
+		default:
+			// Continue


omit comment

jba · 2025-07-07T20:57:28Z

mcp/streamable.go

+		if sessionID == "" {
+			// Session ID not yet established (first POST hasn't completed).
+			// Wait and retry.
+			time.Sleep(100 * time.Millisecond) // Avoid busy-waiting


I think you could avoid this if you kept sessionID in a chan string with capacity 1. This code would receive from the channel, blocking until it was ready. Other code that didn't want to block could do a select. Everyone who reads it would immediately put it back in the channel.

Not sure if it's worth it.

jba · 2025-07-07T20:59:38Z

mcp/streamable.go

+		case <-time.After(delay):
+			retries++
+			backoffDuration *= 2                  // Exponential increase
+			if backoffDuration > 30*time.Second { // Cap backoff duration


Should't you cap it on the other side too?

jba · 2025-07-07T21:00:30Z

mcp/streamable.go

+		case <-time.After(delay):
+			retries++
+			backoffDuration *= 2                  // Exponential increase
+			if backoffDuration > 30*time.Second { // Cap backoff duration


Why are this backoff delay computations so different between the client and server sides? Can they be pulled out into something more general?

jba · 2025-07-07T21:03:53Z

mcp/streamable.go

+			// Message successfully sent to incoming channel
+		case <-s.done:
+			// Connection closed while trying to send incoming message
+			return io.EOF


Should this be io.ErrUnexpectedEOF?

jba · 2025-07-07T21:04:26Z

mcp/streamable.go

+	var httpErr *httpStatusError
+	if errors.As(err, &httpErr) {
+		switch httpErr.StatusCode {
+		case http.StatusRequestTimeout, // 408


OOC, where did this list come from?

jba · 2025-07-07T21:07:17Z

mcp/streamable_test.go

+	}()
+
+	// Wait for all messages to be received, or timeout
+	allMessages := []string{}


var allMessages []string

samthanawalla · 2025-07-08T16:33:59Z

My approach was wrong, oops.

As Rob pointed out, we shouldn't retry POST, I will try again and send a separate PR.

mcp: add retry and replay to the Streamable HTTP implementation

807b9c0

Adds exponential backoff and jitter to the client-side POST. Implements replay using Last-Event-ID for resumability. Since these concepts are intertwined- I have added this in a single CL.

samthanawalla force-pushed the retryStreamable branch from e28d524 to 807b9c0 Compare July 2, 2025 15:59

samthanawalla requested review from jba and findleyr July 2, 2025 16:02

samthanawalla marked this pull request as ready for review July 2, 2025 16:02

findleyr reviewed Jul 2, 2025

View reviewed changes

jba requested changes Jul 7, 2025

View reviewed changes

samthanawalla closed this Jul 8, 2025

samthanawalla deleted the retryStreamable branch July 8, 2025 16:55

mcp: add retry and replay to the Streamable HTTP implementation #86

mcp: add retry and replay to the Streamable HTTP implementation #86

Conversation

samthanawalla commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

samthanawalla commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

samthanawalla commented Jul 2, 2025 •

edited

Loading

samthanawalla commented Jul 8, 2025 •

edited

Loading