# Story: Token-by-Token Streaming Responses ## User Story **As a** User **I want** to see the model's response appear token-by-token as it generates **So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete. ## Acceptance Criteria * [ ] Model responses should appear token-by-token in real-time as Ollama generates them * [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect) * [ ] Tool calls should still work correctly with streaming enabled * [ ] The user should see partial responses immediately, not wait for full completion * [ ] Streaming should work for both text responses and responses that include tool calls * [ ] Error handling should gracefully handle streaming interruptions * [ ] The UI should auto-scroll to follow new tokens as they appear ## Out of Scope * Configurable streaming speed/throttling * Showing thinking/reasoning process separately (that could be a future enhancement) * Streaming for tool outputs (tool outputs can remain non-streaming) ## Implementation Notes ### Backend (Rust) * Change `stream: false` to `stream: true` in Ollama request * Parse streaming JSON response from Ollama (newline-delimited JSON) * Emit `chat:token` events for each token received * Handle both streaming text and tool call responses * Use `reqwest` with streaming body support * Consider using `futures::StreamExt` for async stream processing ### Frontend (TypeScript) * Listen for `chat:token` events * Append tokens to the current assistant message in real-time * Update the UI state without full re-renders (performance) * Maintain smooth auto-scroll as tokens arrive * Handle the transition from streaming text to tool calls ### Ollama Streaming Format Ollama returns newline-delimited JSON when streaming: ```json {"message":{"role":"assistant","content":"Hello"},"done":false} {"message":{"role":"assistant","content":" world"},"done":false} {"message":{"role":"assistant","content":"!"},"done":true} ``` ### Challenges * Parsing streaming JSON (each line is a separate JSON object) * Maintaining state between streaming chunks * Handling tool calls that interrupt streaming text * Performance with high token throughput * Error recovery if stream is interrupted ## Related Functional Specs * Functional Spec: UI/UX (specifically mentions streaming as deferred) ## Dependencies * Story 13 (interruption) should work with streaming * May need `tokio-stream` or similar for stream utilities ## Testing Considerations * Test with long responses to verify smooth streaming * Test with responses that include tool calls * Test interruption during streaming * Test error cases (network issues, Ollama crashes) * Test performance with different token rates