feat: Backend cancellation support for interrupting model responses

Merged from feature/interrupt-on-type branch. Backend cancellation infrastructure: - Added tokio watch channel to SessionState for cancellation signaling - Implemented cancel_chat command - Modified chat command to use tokio::select! for racing requests vs cancellation - When cancelled, HTTP request to Ollama is dropped and returns early - Added tokio dependency with sync feature Story updates: - Story 13: Updated to use Stop button pattern (industry standard) - Story 18: Created placeholder for streaming responses - Stories 15-17: Placeholders for future features Frontend changes: - Removed auto-interrupt on typing behavior (too confusing) - Backend infrastructure ready for Stop button implementation Note: Story 13 UI (Stop button) not yet implemented - backend ready
2025-12-27 15:36:58 +00:00
parent 909e8f1a2a
commit bb700ce870
12 changed files with 261 additions and 7 deletions
--- a/.living_spec/stories/18_streaming_responses.md
+++ b/.living_spec/stories/18_streaming_responses.md
@@ -0,0 +1,66 @@
+# Story: Token-by-Token Streaming Responses
+
+## User Story
+**As a** User
+**I want** to see the model's response appear token-by-token as it generates
+**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
+
+## Acceptance Criteria
+*   [ ] Model responses should appear token-by-token in real-time as Ollama generates them
+*   [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
+*   [ ] Tool calls should still work correctly with streaming enabled
+*   [ ] The user should see partial responses immediately, not wait for full completion
+*   [ ] Streaming should work for both text responses and responses that include tool calls
+*   [ ] Error handling should gracefully handle streaming interruptions
+*   [ ] The UI should auto-scroll to follow new tokens as they appear
+
+## Out of Scope
+*   Configurable streaming speed/throttling
+*   Showing thinking/reasoning process separately (that could be a future enhancement)
+*   Streaming for tool outputs (tool outputs can remain non-streaming)
+
+## Implementation Notes
+
+### Backend (Rust)
+*   Change `stream: false` to `stream: true` in Ollama request
+*   Parse streaming JSON response from Ollama (newline-delimited JSON)
+*   Emit `chat:token` events for each token received
+*   Handle both streaming text and tool call responses
+*   Use `reqwest` with streaming body support
+*   Consider using `futures::StreamExt` for async stream processing
+
+### Frontend (TypeScript)
+*   Listen for `chat:token` events
+*   Append tokens to the current assistant message in real-time
+*   Update the UI state without full re-renders (performance)
+*   Maintain smooth auto-scroll as tokens arrive
+*   Handle the transition from streaming text to tool calls
+
+### Ollama Streaming Format
+Ollama returns newline-delimited JSON when streaming:
+```json
+{"message":{"role":"assistant","content":"Hello"},"done":false}
+{"message":{"role":"assistant","content":" world"},"done":false}
+{"message":{"role":"assistant","content":"!"},"done":true}
+```
+
+### Challenges
+*   Parsing streaming JSON (each line is a separate JSON object)
+*   Maintaining state between streaming chunks
+*   Handling tool calls that interrupt streaming text
+*   Performance with high token throughput
+*   Error recovery if stream is interrupted
+
+## Related Functional Specs
+*   Functional Spec: UI/UX (specifically mentions streaming as deferred)
+
+## Dependencies
+*   Story 13 (interruption) should work with streaming
+*   May need `tokio-stream` or similar for stream utilities
+
+## Testing Considerations
+*   Test with long responses to verify smooth streaming
+*   Test with responses that include tool calls
+*   Test interruption during streaming
+*   Test error cases (network issues, Ollama crashes)
+*   Test performance with different token rates