Story 18: Token-by-Token Streaming Responses

User Story

As a user, I want to see the AI's response appear token-by-token in real-time (like ChatGPT), so that I get immediate feedback and know the system is working, rather than waiting for the entire response to appear at once.

Acceptance Criteria

Tokens appear in the chat interface as Ollama generates them, not all at once
The streaming experience is smooth with no visible lag or stuttering
Auto-scroll keeps the latest token visible as content streams in
When streaming completes, the message is properly added to the message history
Tool calls work correctly: if Ollama decides to call a tool mid-stream, streaming stops gracefully and tool execution begins
The Stop button (Story 13) works during streaming to cancel mid-response
If streaming is interrupted (network error, cancellation), partial content is preserved and an appropriate error state is shown
Multi-turn conversations continue to work: streaming doesn't break the message history or context

Out of Scope

Streaming for tool outputs (tools execute and return results as before, non-streaming)
Throttling or rate-limiting token display (we stream all tokens as fast as Ollama sends them)
Custom streaming animations or effects beyond simple text append
Streaming from other LLM providers (Claude, GPT, etc.) - this story focuses on Ollama only

Technical Notes

Backend must enable stream: true in Ollama API requests
Ollama returns newline-delimited JSON, one object per token
Backend emits chat:token events (one per token) to frontend
Frontend appends tokens to a streaming buffer and renders in real-time
When streaming completes (done: true), backend emits chat:update with full message
Tool calls are detected when Ollama sends tool_calls in the response, which triggers tool execution flow

1.9 KiB Raw Blame History

Story 18: Token-by-Token Streaming Responses

User Story

Acceptance Criteria

Out of Scope

Technical Notes

1.9 KiB

Raw Blame History