feat: Backend cancellation support for interrupting model responses

Merged from feature/interrupt-on-type branch.

Backend cancellation infrastructure:
- Added tokio watch channel to SessionState for cancellation signaling
- Implemented cancel_chat command
- Modified chat command to use tokio::select! for racing requests vs cancellation
- When cancelled, HTTP request to Ollama is dropped and returns early
- Added tokio dependency with sync feature

Story updates:
- Story 13: Updated to use Stop button pattern (industry standard)
- Story 18: Created placeholder for streaming responses
- Stories 15-17: Placeholders for future features

Frontend changes:
- Removed auto-interrupt on typing behavior (too confusing)
- Backend infrastructure ready for Stop button implementation

Note: Story 13 UI (Stop button) not yet implemented - backend ready
This commit is contained in:
Dave
2025-12-27 15:36:58 +00:00
parent 909e8f1a2a
commit bb700ce870
12 changed files with 261 additions and 7 deletions

View File

@@ -0,0 +1,66 @@
# Story: Token-by-Token Streaming Responses
## User Story
**As a** User
**I want** to see the model's response appear token-by-token as it generates
**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
## Acceptance Criteria
* [ ] Model responses should appear token-by-token in real-time as Ollama generates them
* [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
* [ ] Tool calls should still work correctly with streaming enabled
* [ ] The user should see partial responses immediately, not wait for full completion
* [ ] Streaming should work for both text responses and responses that include tool calls
* [ ] Error handling should gracefully handle streaming interruptions
* [ ] The UI should auto-scroll to follow new tokens as they appear
## Out of Scope
* Configurable streaming speed/throttling
* Showing thinking/reasoning process separately (that could be a future enhancement)
* Streaming for tool outputs (tool outputs can remain non-streaming)
## Implementation Notes
### Backend (Rust)
* Change `stream: false` to `stream: true` in Ollama request
* Parse streaming JSON response from Ollama (newline-delimited JSON)
* Emit `chat:token` events for each token received
* Handle both streaming text and tool call responses
* Use `reqwest` with streaming body support
* Consider using `futures::StreamExt` for async stream processing
### Frontend (TypeScript)
* Listen for `chat:token` events
* Append tokens to the current assistant message in real-time
* Update the UI state without full re-renders (performance)
* Maintain smooth auto-scroll as tokens arrive
* Handle the transition from streaming text to tool calls
### Ollama Streaming Format
Ollama returns newline-delimited JSON when streaming:
```json
{"message":{"role":"assistant","content":"Hello"},"done":false}
{"message":{"role":"assistant","content":" world"},"done":false}
{"message":{"role":"assistant","content":"!"},"done":true}
```
### Challenges
* Parsing streaming JSON (each line is a separate JSON object)
* Maintaining state between streaming chunks
* Handling tool calls that interrupt streaming text
* Performance with high token throughput
* Error recovery if stream is interrupted
## Related Functional Specs
* Functional Spec: UI/UX (specifically mentions streaming as deferred)
## Dependencies
* Story 13 (interruption) should work with streaming
* May need `tokio-stream` or similar for stream utilities
## Testing Considerations
* Test with long responses to verify smooth streaming
* Test with responses that include tool calls
* Test interruption during streaming
* Test error cases (network issues, Ollama crashes)
* Test performance with different token rates