28 lines
1.9 KiB
Markdown
28 lines
1.9 KiB
Markdown
|
|
# Story 18: Token-by-Token Streaming Responses
|
||
|
|
|
||
|
|
## User Story
|
||
|
|
As a user, I want to see the AI's response appear token-by-token in real-time (like ChatGPT), so that I get immediate feedback and know the system is working, rather than waiting for the entire response to appear at once.
|
||
|
|
|
||
|
|
## Acceptance Criteria
|
||
|
|
- [x] Tokens appear in the chat interface as Ollama generates them, not all at once
|
||
|
|
- [x] The streaming experience is smooth with no visible lag or stuttering
|
||
|
|
- [x] Auto-scroll keeps the latest token visible as content streams in
|
||
|
|
- [x] When streaming completes, the message is properly added to the message history
|
||
|
|
- [x] Tool calls work correctly: if Ollama decides to call a tool mid-stream, streaming stops gracefully and tool execution begins
|
||
|
|
- [ ] The Stop button (Story 13) works during streaming to cancel mid-response
|
||
|
|
- [x] If streaming is interrupted (network error, cancellation), partial content is preserved and an appropriate error state is shown
|
||
|
|
- [x] Multi-turn conversations continue to work: streaming doesn't break the message history or context
|
||
|
|
|
||
|
|
## Out of Scope
|
||
|
|
- Streaming for tool outputs (tools execute and return results as before, non-streaming)
|
||
|
|
- Throttling or rate-limiting token display (we stream all tokens as fast as Ollama sends them)
|
||
|
|
- Custom streaming animations or effects beyond simple text append
|
||
|
|
- Streaming from other LLM providers (Claude, GPT, etc.) - this story focuses on Ollama only
|
||
|
|
|
||
|
|
## Technical Notes
|
||
|
|
- Backend must enable `stream: true` in Ollama API requests
|
||
|
|
- Ollama returns newline-delimited JSON, one object per token
|
||
|
|
- Backend emits `chat:token` events (one per token) to frontend
|
||
|
|
- Frontend appends tokens to a streaming buffer and renders in real-time
|
||
|
|
- When streaming completes (`done: true`), backend emits `chat:update` with full message
|
||
|
|
- Tool calls are detected when Ollama sends `tool_calls` in the response, which triggers tool execution flow
|