Files
storkit/.living_spec/stories/18_streaming_responses.md

66 lines
2.8 KiB
Markdown
Raw Normal View History

# Story: Token-by-Token Streaming Responses
## User Story
**As a** User
**I want** to see the model's response appear token-by-token as it generates
**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
## Acceptance Criteria
* [ ] Model responses should appear token-by-token in real-time as Ollama generates them
* [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
* [ ] Tool calls should still work correctly with streaming enabled
* [ ] The user should see partial responses immediately, not wait for full completion
* [ ] Streaming should work for both text responses and responses that include tool calls
* [ ] Error handling should gracefully handle streaming interruptions
* [ ] The UI should auto-scroll to follow new tokens as they appear
## Out of Scope
* Configurable streaming speed/throttling
* Showing thinking/reasoning process separately (that could be a future enhancement)
* Streaming for tool outputs (tool outputs can remain non-streaming)
## Implementation Notes
### Backend (Rust)
* Change `stream: false` to `stream: true` in Ollama request
* Parse streaming JSON response from Ollama (newline-delimited JSON)
* Emit `chat:token` events for each token received
* Handle both streaming text and tool call responses
* Use `reqwest` with streaming body support
* Consider using `futures::StreamExt` for async stream processing
### Frontend (TypeScript)
* Listen for `chat:token` events
* Append tokens to the current assistant message in real-time
* Update the UI state without full re-renders (performance)
* Maintain smooth auto-scroll as tokens arrive
* Handle the transition from streaming text to tool calls
### Ollama Streaming Format
Ollama returns newline-delimited JSON when streaming:
```json
{"message":{"role":"assistant","content":"Hello"},"done":false}
{"message":{"role":"assistant","content":" world"},"done":false}
{"message":{"role":"assistant","content":"!"},"done":true}
```
### Challenges
* Parsing streaming JSON (each line is a separate JSON object)
* Maintaining state between streaming chunks
* Handling tool calls that interrupt streaming text
* Performance with high token throughput
* Error recovery if stream is interrupted
## Related Functional Specs
* Functional Spec: UI/UX (specifically mentions streaming as deferred)
## Dependencies
* Story 13 (interruption) should work with streaming
* May need `tokio-stream` or similar for stream utilities
## Testing Considerations
* Test with long responses to verify smooth streaming
* Test with responses that include tool calls
* Test interruption during streaming
* Test error cases (network issues, Ollama crashes)
* Test performance with different token rates