feat: Backend cancellation support for interrupting model responses
Merged from feature/interrupt-on-type branch. Backend cancellation infrastructure: - Added tokio watch channel to SessionState for cancellation signaling - Implemented cancel_chat command - Modified chat command to use tokio::select! for racing requests vs cancellation - When cancelled, HTTP request to Ollama is dropped and returns early - Added tokio dependency with sync feature Story updates: - Story 13: Updated to use Stop button pattern (industry standard) - Story 18: Created placeholder for streaming responses - Stories 15-17: Placeholders for future features Frontend changes: - Removed auto-interrupt on typing behavior (too confusing) - Backend infrastructure ready for Stop button implementation Note: Story 13 UI (Stop button) not yet implemented - backend ready
This commit is contained in:
66
.living_spec/stories/18_streaming_responses.md
Normal file
66
.living_spec/stories/18_streaming_responses.md
Normal file
@@ -0,0 +1,66 @@
|
||||
# Story: Token-by-Token Streaming Responses
|
||||
|
||||
## User Story
|
||||
**As a** User
|
||||
**I want** to see the model's response appear token-by-token as it generates
|
||||
**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
|
||||
|
||||
## Acceptance Criteria
|
||||
* [ ] Model responses should appear token-by-token in real-time as Ollama generates them
|
||||
* [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
|
||||
* [ ] Tool calls should still work correctly with streaming enabled
|
||||
* [ ] The user should see partial responses immediately, not wait for full completion
|
||||
* [ ] Streaming should work for both text responses and responses that include tool calls
|
||||
* [ ] Error handling should gracefully handle streaming interruptions
|
||||
* [ ] The UI should auto-scroll to follow new tokens as they appear
|
||||
|
||||
## Out of Scope
|
||||
* Configurable streaming speed/throttling
|
||||
* Showing thinking/reasoning process separately (that could be a future enhancement)
|
||||
* Streaming for tool outputs (tool outputs can remain non-streaming)
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Backend (Rust)
|
||||
* Change `stream: false` to `stream: true` in Ollama request
|
||||
* Parse streaming JSON response from Ollama (newline-delimited JSON)
|
||||
* Emit `chat:token` events for each token received
|
||||
* Handle both streaming text and tool call responses
|
||||
* Use `reqwest` with streaming body support
|
||||
* Consider using `futures::StreamExt` for async stream processing
|
||||
|
||||
### Frontend (TypeScript)
|
||||
* Listen for `chat:token` events
|
||||
* Append tokens to the current assistant message in real-time
|
||||
* Update the UI state without full re-renders (performance)
|
||||
* Maintain smooth auto-scroll as tokens arrive
|
||||
* Handle the transition from streaming text to tool calls
|
||||
|
||||
### Ollama Streaming Format
|
||||
Ollama returns newline-delimited JSON when streaming:
|
||||
```json
|
||||
{"message":{"role":"assistant","content":"Hello"},"done":false}
|
||||
{"message":{"role":"assistant","content":" world"},"done":false}
|
||||
{"message":{"role":"assistant","content":"!"},"done":true}
|
||||
```
|
||||
|
||||
### Challenges
|
||||
* Parsing streaming JSON (each line is a separate JSON object)
|
||||
* Maintaining state between streaming chunks
|
||||
* Handling tool calls that interrupt streaming text
|
||||
* Performance with high token throughput
|
||||
* Error recovery if stream is interrupted
|
||||
|
||||
## Related Functional Specs
|
||||
* Functional Spec: UI/UX (specifically mentions streaming as deferred)
|
||||
|
||||
## Dependencies
|
||||
* Story 13 (interruption) should work with streaming
|
||||
* May need `tokio-stream` or similar for stream utilities
|
||||
|
||||
## Testing Considerations
|
||||
* Test with long responses to verify smooth streaming
|
||||
* Test with responses that include tool calls
|
||||
* Test interruption during streaming
|
||||
* Test error cases (network issues, Ollama crashes)
|
||||
* Test performance with different token rates
|
||||
Reference in New Issue
Block a user