Files
storkit/.living_spec/stories/18_streaming_responses.md
Dave bb700ce870 feat: Backend cancellation support for interrupting model responses
Merged from feature/interrupt-on-type branch.

Backend cancellation infrastructure:
- Added tokio watch channel to SessionState for cancellation signaling
- Implemented cancel_chat command
- Modified chat command to use tokio::select! for racing requests vs cancellation
- When cancelled, HTTP request to Ollama is dropped and returns early
- Added tokio dependency with sync feature

Story updates:
- Story 13: Updated to use Stop button pattern (industry standard)
- Story 18: Created placeholder for streaming responses
- Stories 15-17: Placeholders for future features

Frontend changes:
- Removed auto-interrupt on typing behavior (too confusing)
- Backend infrastructure ready for Stop button implementation

Note: Story 13 UI (Stop button) not yet implemented - backend ready
2025-12-27 15:36:58 +00:00

2.8 KiB

Story: Token-by-Token Streaming Responses

User Story

As a User I want to see the model's response appear token-by-token as it generates So that I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.

Acceptance Criteria

  • Model responses should appear token-by-token in real-time as Ollama generates them
  • The streaming should feel smooth and responsive (like ChatGPT's typing effect)
  • Tool calls should still work correctly with streaming enabled
  • The user should see partial responses immediately, not wait for full completion
  • Streaming should work for both text responses and responses that include tool calls
  • Error handling should gracefully handle streaming interruptions
  • The UI should auto-scroll to follow new tokens as they appear

Out of Scope

  • Configurable streaming speed/throttling
  • Showing thinking/reasoning process separately (that could be a future enhancement)
  • Streaming for tool outputs (tool outputs can remain non-streaming)

Implementation Notes

Backend (Rust)

  • Change stream: false to stream: true in Ollama request
  • Parse streaming JSON response from Ollama (newline-delimited JSON)
  • Emit chat:token events for each token received
  • Handle both streaming text and tool call responses
  • Use reqwest with streaming body support
  • Consider using futures::StreamExt for async stream processing

Frontend (TypeScript)

  • Listen for chat:token events
  • Append tokens to the current assistant message in real-time
  • Update the UI state without full re-renders (performance)
  • Maintain smooth auto-scroll as tokens arrive
  • Handle the transition from streaming text to tool calls

Ollama Streaming Format

Ollama returns newline-delimited JSON when streaming:

{"message":{"role":"assistant","content":"Hello"},"done":false}
{"message":{"role":"assistant","content":" world"},"done":false}
{"message":{"role":"assistant","content":"!"},"done":true}

Challenges

  • Parsing streaming JSON (each line is a separate JSON object)
  • Maintaining state between streaming chunks
  • Handling tool calls that interrupt streaming text
  • Performance with high token throughput
  • Error recovery if stream is interrupted
  • Functional Spec: UI/UX (specifically mentions streaming as deferred)

Dependencies

  • Story 13 (interruption) should work with streaming
  • May need tokio-stream or similar for stream utilities

Testing Considerations

  • Test with long responses to verify smooth streaming
  • Test with responses that include tool calls
  • Test interruption during streaming
  • Test error cases (network issues, Ollama crashes)
  • Test performance with different token rates