66 lines
2.8 KiB
Markdown
66 lines
2.8 KiB
Markdown
|
|
# Story: Token-by-Token Streaming Responses
|
||
|
|
|
||
|
|
## User Story
|
||
|
|
**As a** User
|
||
|
|
**I want** to see the model's response appear token-by-token as it generates
|
||
|
|
**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
|
||
|
|
|
||
|
|
## Acceptance Criteria
|
||
|
|
* [ ] Model responses should appear token-by-token in real-time as Ollama generates them
|
||
|
|
* [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
|
||
|
|
* [ ] Tool calls should still work correctly with streaming enabled
|
||
|
|
* [ ] The user should see partial responses immediately, not wait for full completion
|
||
|
|
* [ ] Streaming should work for both text responses and responses that include tool calls
|
||
|
|
* [ ] Error handling should gracefully handle streaming interruptions
|
||
|
|
* [ ] The UI should auto-scroll to follow new tokens as they appear
|
||
|
|
|
||
|
|
## Out of Scope
|
||
|
|
* Configurable streaming speed/throttling
|
||
|
|
* Showing thinking/reasoning process separately (that could be a future enhancement)
|
||
|
|
* Streaming for tool outputs (tool outputs can remain non-streaming)
|
||
|
|
|
||
|
|
## Implementation Notes
|
||
|
|
|
||
|
|
### Backend (Rust)
|
||
|
|
* Change `stream: false` to `stream: true` in Ollama request
|
||
|
|
* Parse streaming JSON response from Ollama (newline-delimited JSON)
|
||
|
|
* Emit `chat:token` events for each token received
|
||
|
|
* Handle both streaming text and tool call responses
|
||
|
|
* Use `reqwest` with streaming body support
|
||
|
|
* Consider using `futures::StreamExt` for async stream processing
|
||
|
|
|
||
|
|
### Frontend (TypeScript)
|
||
|
|
* Listen for `chat:token` events
|
||
|
|
* Append tokens to the current assistant message in real-time
|
||
|
|
* Update the UI state without full re-renders (performance)
|
||
|
|
* Maintain smooth auto-scroll as tokens arrive
|
||
|
|
* Handle the transition from streaming text to tool calls
|
||
|
|
|
||
|
|
### Ollama Streaming Format
|
||
|
|
Ollama returns newline-delimited JSON when streaming:
|
||
|
|
```json
|
||
|
|
{"message":{"role":"assistant","content":"Hello"},"done":false}
|
||
|
|
{"message":{"role":"assistant","content":" world"},"done":false}
|
||
|
|
{"message":{"role":"assistant","content":"!"},"done":true}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Challenges
|
||
|
|
* Parsing streaming JSON (each line is a separate JSON object)
|
||
|
|
* Maintaining state between streaming chunks
|
||
|
|
* Handling tool calls that interrupt streaming text
|
||
|
|
* Performance with high token throughput
|
||
|
|
* Error recovery if stream is interrupted
|
||
|
|
|
||
|
|
## Related Functional Specs
|
||
|
|
* Functional Spec: UI/UX (specifically mentions streaming as deferred)
|
||
|
|
|
||
|
|
## Dependencies
|
||
|
|
* Story 13 (interruption) should work with streaming
|
||
|
|
* May need `tokio-stream` or similar for stream utilities
|
||
|
|
|
||
|
|
## Testing Considerations
|
||
|
|
* Test with long responses to verify smooth streaming
|
||
|
|
* Test with responses that include tool calls
|
||
|
|
* Test interruption during streaming
|
||
|
|
* Test error cases (network issues, Ollama crashes)
|
||
|
|
* Test performance with different token rates
|