-
Notifications
You must be signed in to change notification settings - Fork 711
Description
Bug Description
After a ClaudeSDKClient session ends (via client.disconnect() with timeout + force-kill fallback), the Python process continues to burn ~24% CPU indefinitely, even when completely idle with no active sessions.
The root cause is a leaked CLOSE_WAIT TCP socket to the Anthropic API that remains registered in the kqueue event loop. Since CLOSE_WAIT sockets are permanently "readable" (EOF pending), kqueue returns them as ready on every poll cycle, causing the asyncio event loop to busy-spin.
How This Differs from #378
Issue #378 describes close() hanging during the call due to _deliver_cancellation spinning. Our issue is about what happens after — even when disconnect completes or times out and the subprocess is force-killed:
- A TCP socket to the Anthropic API remains in CLOSE_WAIT state
- The socket FD stays registered in kqueue
- The asyncio event loop spins polling this permanently-readable FD
- CPU stays at ~24% with the process doing absolutely nothing
Reproduction
We run a long-lived FastAPI daemon that uses ClaudeSDKClient for periodic tasks. Between tasks, the daemon should be near 0% CPU.
# Simplified pattern
client = ClaudeSDKClient(ClaudeAgentOptions(max_turns=5))
# ... use client ...
# Disconnect with timeout (workaround from #378)
try:
await asyncio.wait_for(client.disconnect(), timeout=5.0)
except asyncio.TimeoutError:
# Force-kill the subprocess
os.kill(subprocess_pid, signal.SIGKILL)After this sequence, lsof shows the leaked socket:
Python 71226 user 13u IPv6 ... TCP [local]:59274->[2600:9000:2134:...]:https (CLOSE_WAIT)
And sample confirms kqueue spin:
789/889 samples in:
select_kqueue_control_impl → kevent (should be blocking, but returns immediately)
Evidence
- Process state: Daemon fully idle — no active sessions, no scheduled tasks running
- CPU: 23.7% sustained, for over 1 hour
lsofoutput: CLOSE_WAIT TCP socket (FD 13) to Anthropic API endpoint, never closedsampleoutput: 88.7% of samples inkeventcall, but CPU not idle — kqueue returning immediately due to permanently-readable CLOSE_WAIT socket- No orphaned pipes: Subprocess pipes were properly closed (we implemented a workaround for that). The socket is from the SDK's internal HTTP transport, not the subprocess stdio.
Root Cause Analysis
The SDK (or its HTTP transport layer) opens HTTPS connections to the Anthropic API. When the remote server closes the connection (sends FIN):
- The local TCP stack ACKs the FIN → socket enters CLOSE_WAIT
- The SDK never calls
close()on the socket - The socket's FD remains registered in kqueue (via asyncio's event loop)
- kqueue reports it as readable every poll cycle (EOF is pending)
- asyncio event loop never blocks → CPU spin
The asyncio.wait_for() workaround from #378 doesn't help here because:
- The socket leak is independent of the task group cancellation issue
- Even after force-killing the subprocess and closing its pipes, the HTTP socket persists
- As noted in Query.close() can hang indefinitely causing 100% CPU usage due to missing timeout on task group cleanup #378 comments, anyio cancellation doesn't propagate cleanly through asyncio
Suggested Fix
The SDK's transport layer should:
- Track all opened sockets (HTTP connections to the API, not just subprocess pipes)
- Close them in
transport.close()— ensureclose()on the TCP socket is called - Deregister from the event loop — remove the FD from kqueue/epoll before closing
Alternatively, a defensive cleanup in Query.close():
async def close(self) -> None:
self._closed = True
if self._tg:
self._tg.cancel_scope.cancel()
with suppress(anyio.get_cancelled_exc_class()):
try:
with anyio.fail_after(5.0):
await self._tg.__aexit__(None, None, None)
except TimeoutError:
pass
await self.transport.close()
# Defensive: close any remaining sockets to prevent kqueue spin
self._close_leaked_fds()Environment
- claude-agent-sdk: 0.1.45
- Python: 3.13.5
- Platform: macOS 15.6.1 (Darwin 24.6.0), ARM64 (Apple Silicon)
- Event loop: asyncio with kqueue selector
- Use case: Long-running FastAPI daemon with periodic
ClaudeSDKClientsessions
Related
- Query.close() can hang indefinitely causing 100% CPU usage due to missing timeout on task group cleanup #378 —
Query.close()hangs indefinitely causing 100% CPU (same family of bugs, different manifestation)