Skip to content

[BUG] McpStreamableServerSession does not close server-side socket when client disconnects, causing CLOSE-WAIT leak and thread pool exhaustion #1021

@lxq19991111

Description

@lxq19991111

Description

When using the MCP Java SDK's Streamable HTTP server transport (via spring-ai-starter-mcp-server-webmvc), the server-side socket is not properly closed after the client disconnects (sends TCP FIN). This causes connections to remain in CLOSE-WAIT state indefinitely, each holding a Tomcat worker thread. Under moderate load, the entire Tomcat thread pool is exhausted within seconds, making the server completely unresponsive to any new requests including health checks.

Environment

  • Spring AI: 1.1.4
  • MCP Java SDK: (bundled with Spring AI 1.1.4)
  • Java: JDK 25
  • Server: Tomcat 10.1.34 (WAR deployment via Spring Boot 3.4.1)
  • Transport: Streamable HTTP (spring.ai.mcp.server.protocol=STREAMABLE)
  • OS: Linux (Kubernetes pod, 2 CPU / 4GB RAM)

Configuration

spring:
  ai:
    mcp:
      server:
        type: SYNC
        protocol: STREAMABLE
        streamable-http:
          mcp-endpoint: /mcp
          keep-alive-interval: 0s

Steps to Reproduce

  1. Deploy an MCP Server with Streamable HTTP transport (WebMVC, SYNC mode)
  2. Have an external MCP client send requests to POST /mcp (initialize + tools/call)
  3. Client receives the tool response and closes the TCP connection (sends FIN)
  4. Repeat with multiple clients (or a single client with retry logic)
  5. Observe server-side socket states with ss -tnp | grep 8080

Observed Behavior

After the client closes the connection:

  • Server-side socket enters CLOSE-WAIT and is never closed
  • The Tomcat worker thread handling that request is never released back to the pool
  • Under load from a single upstream LB doing health-check retries, all 150 Tomcat threads are exhausted within ~30 seconds
  • New connections (including K8s readiness probes) queue in the TCP backlog and time out
$ ss -tlnp | grep 8080
LISTEN 151    150    *:8080    *:*

$ ss -tnp | grep 8080 | head -5
CLOSE-WAIT 115  0  [::ffff:10.125.87.86]:8080  [::ffff:10.125.87.4]:47140
CLOSE-WAIT 115  0  [::ffff:10.125.87.86]:8080  [::ffff:10.125.87.4]:42756
CLOSE-WAIT 115  0  [::ffff:10.125.87.86]:8080  [::ffff:10.125.87.4]:47446
CLOSE-WAIT 115  0  [::ffff:10.125.87.86]:8080  [::ffff:10.125.87.4]:50138
CLOSE-WAIT 115  0  [::ffff:10.125.87.86]:8080  [::ffff:10.125.87.4]:43160

$ ss -tnp | grep 8080 | wc -l
150

$ curl --max-time 5 http://localhost:8080/health
curl: (28) Failed to connect to localhost port 8080: Connection timed out

All 150 connections are from the same upstream IP (load balancer), all in CLOSE-WAIT.

Expected Behavior

When the client closes the TCP connection (sends FIN), the server should:

  1. Detect the peer shutdown (e.g., via IOException on write, or checking SocketChannel.read() == -1)
  2. Close the SSE stream / Reactor Sink associated with that session
  3. Remove the session from the internal session map
  4. Close the server-side socket
  5. Release the Tomcat thread back to the pool

Root Cause Analysis

The MCP Streamable HTTP transport opens an SSE stream for each session. When the client disconnects:

  • The server-side Sinks.Many has no subscribers, but the stream is never terminated
  • The Servlet async context is never completed
  • The socket remains open on the server side (only client sent FIN)
  • Tomcat's NIO connector holds the thread waiting for the async context to complete

Impact

  • Severity: Critical — renders the server completely unresponsive
  • Makes rolling deployments impossible in production (new pods get flooded by retrying clients immediately after startup)
  • K8s readiness probes fail → pod marked unhealthy → never enters service
  • No automatic recovery — requires pod restart AND stopping upstream traffic simultaneously

Workaround

Set Tomcat connection timeout to force-close idle connections:

server:
  tomcat:
    connection-timeout: 30000
    keep-alive-timeout: 30000
    max-connections: 200
    threads:
      max: 200

This allows Tomcat to reclaim CLOSE-WAIT connections after 30 seconds, but is not a proper fix — it just limits the damage window.

Suggested Fix

The Streamable HTTP transport provider should register a listener for client disconnect events. In the WebMVC integration:

// When setting up the async response for SSE:
asyncContext.addListener(new AsyncListener() {
    @Override
    public void onComplete(AsyncEvent event) {
        cleanupSession(sessionId);
    }
    @Override
    public void onTimeout(AsyncEvent event) {
        cleanupSession(sessionId);
    }
    @Override
    public void onError(AsyncEvent event) {
        cleanupSession(sessionId);
    }
    // ...
});

Or detect write failures when attempting to send data to the client and trigger session cleanup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions