Files
storkit/.living_spec/stories/archive/18_streaming_responses.md
Dave 64d1b788be Story 18: Token-by-token streaming responses
- Backend: Added OllamaProvider::chat_stream() with newline-delimited JSON parsing
- Backend: Emit chat:token events for each token received from Ollama
- Backend: Added futures dependency and stream feature for reqwest
- Frontend: Added streamingContent state and chat:token event listener
- Frontend: Real-time token display with auto-scroll
- Frontend: Markdown and syntax highlighting support for streaming content
- Fixed all TypeScript errors (tsc --noEmit)
- Fixed all Biome warnings and errors
- Fixed all Clippy warnings
- Added comprehensive code quality documentation
- Added tsc --noEmit to verification checklist

Tested and verified:
- Tokens stream in real-time
- Auto-scroll works during streaming
- Tool calls interrupt streaming correctly
- Multi-turn conversations work
- Smooth performance with no lag
2025-12-27 16:50:18 +00:00

1.9 KiB

Story 18: Token-by-Token Streaming Responses

User Story

As a user, I want to see the AI's response appear token-by-token in real-time (like ChatGPT), so that I get immediate feedback and know the system is working, rather than waiting for the entire response to appear at once.

Acceptance Criteria

  • Tokens appear in the chat interface as Ollama generates them, not all at once
  • The streaming experience is smooth with no visible lag or stuttering
  • Auto-scroll keeps the latest token visible as content streams in
  • When streaming completes, the message is properly added to the message history
  • Tool calls work correctly: if Ollama decides to call a tool mid-stream, streaming stops gracefully and tool execution begins
  • The Stop button (Story 13) works during streaming to cancel mid-response
  • If streaming is interrupted (network error, cancellation), partial content is preserved and an appropriate error state is shown
  • Multi-turn conversations continue to work: streaming doesn't break the message history or context

Out of Scope

  • Streaming for tool outputs (tools execute and return results as before, non-streaming)
  • Throttling or rate-limiting token display (we stream all tokens as fast as Ollama sends them)
  • Custom streaming animations or effects beyond simple text append
  • Streaming from other LLM providers (Claude, GPT, etc.) - this story focuses on Ollama only

Technical Notes

  • Backend must enable stream: true in Ollama API requests
  • Ollama returns newline-delimited JSON, one object per token
  • Backend emits chat:token events (one per token) to frontend
  • Frontend appends tokens to a streaming buffer and renders in real-time
  • When streaming completes (done: true), backend emits chat:update with full message
  • Tool calls are detected when Ollama sends tool_calls in the response, which triggers tool execution flow