Files

Dave 64d1b788be Story 18: Token-by-token streaming responses

- Backend: Added OllamaProvider::chat_stream() with newline-delimited JSON parsing
- Backend: Emit chat:token events for each token received from Ollama
- Backend: Added futures dependency and stream feature for reqwest
- Frontend: Added streamingContent state and chat:token event listener
- Frontend: Real-time token display with auto-scroll
- Frontend: Markdown and syntax highlighting support for streaming content
- Fixed all TypeScript errors (tsc --noEmit)
- Fixed all Biome warnings and errors
- Fixed all Clippy warnings
- Added comprehensive code quality documentation
- Added tsc --noEmit to verification checklist

Tested and verified:
- Tokens stream in real-time
- Auto-scroll works during streaming
- Tool calls interrupt streaming correctly
- Multi-turn conversations work
- Smooth performance with no lag

2025-12-27 16:50:18 +00:00

1.9 KiB

Raw Blame History

Story 18: Token-by-Token Streaming Responses

User Story

As a user, I want to see the AI's response appear token-by-token in real-time (like ChatGPT), so that I get immediate feedback and know the system is working, rather than waiting for the entire response to appear at once.

Acceptance Criteria

Tokens appear in the chat interface as Ollama generates them, not all at once
The streaming experience is smooth with no visible lag or stuttering
Auto-scroll keeps the latest token visible as content streams in
When streaming completes, the message is properly added to the message history
Tool calls work correctly: if Ollama decides to call a tool mid-stream, streaming stops gracefully and tool execution begins
The Stop button (Story 13) works during streaming to cancel mid-response
If streaming is interrupted (network error, cancellation), partial content is preserved and an appropriate error state is shown
Multi-turn conversations continue to work: streaming doesn't break the message history or context

Out of Scope

Streaming for tool outputs (tools execute and return results as before, non-streaming)
Throttling or rate-limiting token display (we stream all tokens as fast as Ollama sends them)
Custom streaming animations or effects beyond simple text append
Streaming from other LLM providers (Claude, GPT, etc.) - this story focuses on Ollama only

Technical Notes

Backend must enable stream: true in Ollama API requests
Ollama returns newline-delimited JSON, one object per token
Backend emits chat:token events (one per token) to frontend
Frontend appends tokens to a streaming buffer and renders in real-time
When streaming completes (done: true), backend emits chat:update with full message
Tool calls are detected when Ollama sends tool_calls in the response, which triggers tool execution flow

1.9 KiB Raw Blame History