feat: Backend cancellation support for interrupting model responses

Merged from feature/interrupt-on-type branch. Backend cancellation infrastructure: - Added tokio watch channel to SessionState for cancellation signaling - Implemented cancel_chat command - Modified chat command to use tokio::select! for racing requests vs cancellation - When cancelled, HTTP request to Ollama is dropped and returns early - Added tokio dependency with sync feature Story updates: - Story 13: Updated to use Stop button pattern (industry standard) - Story 18: Created placeholder for streaming responses - Stories 15-17: Placeholders for future features Frontend changes: - Removed auto-interrupt on typing behavior (too confusing) - Backend infrastructure ready for Stop button implementation Note: Story 13 UI (Stop button) not yet implemented - backend ready
2025-12-27 15:36:58 +00:00
parent 909e8f1a2a
commit bb700ce870
12 changed files with 261 additions and 7 deletions
--- a/.living_spec/stories/13_interrupt_on_typing.md
+++ b/.living_spec/stories/13_interrupt_on_typing.md
@@ -0,0 +1,94 @@
+# Story: Stop Button to Cancel Model Response
+
+## User Story
+**As a** User
+**I want** a Stop button to appear while the model is generating a response
+**So that** I can explicitly cancel long-running or unwanted responses without waiting for completion.
+
+## Acceptance Criteria
+*   [ ] A "Stop" button should appear in place of the Send button while the model is generating
+*   [ ] Clicking the Stop button should immediately cancel the ongoing generation
+*   [ ] The backend request to Ollama should be cancelled (not just ignored)
+*   [ ] Any partial response generated before stopping should remain visible in the chat
+*   [ ] The UI should return to normal state (Send button visible, input enabled) after stopping
+*   [ ] The input field should remain enabled during generation (user can type while waiting)
+*   [ ] Optional: Escape key should also trigger stop (keyboard shortcut)
+*   [ ] The stopped message should remain in history (not be removed)
+
+## Out of Scope
+*   Automatic interruption by typing (too aggressive)
+*   Confirmation dialog before stopping (immediate action is preferred)
+*   Undo/redo functionality after stopping
+*   Streaming partial responses (that's Story 18)
+
+## Implementation Notes
+
+### Frontend (TypeScript)
+*   Replace Send button (↑) with Stop button (⬛ or "Stop") when `loading` is true
+*   On Stop click, call `invoke("cancel_chat")` and set `loading = false`
+*   Keep input field enabled during generation (no `disabled` attribute)
+*   Optional: Add Escape key handler to trigger stop when input is focused
+*   Visual design: Make Stop button clearly distinct from Send button
+
+### Backend (Rust)
+*   ✅ Already implemented: `cancel_chat` command with tokio watch channel
+*   ✅ Already implemented: `tokio::select!` racing Ollama request vs cancellation
+*   When cancelled, backend returns early with "Chat cancelled by user" error
+*   Partial messages from completed tool calls remain in history
+
+### UX Flow
+1. User sends message → Send button changes to Stop button
+2. Model starts generating → User sees "Thinking..." and Stop button
+3. User clicks Stop → Backend cancels Ollama request
+4. Partial response (if any) stays visible in chat
+5. Stop button changes back to Send button
+6. User can now send a new message
+
+### Standard Pattern (ChatGPT/Claude style)
+*   Stop button is the standard pattern used by ChatGPT, Claude, and other chat UIs
+*   No auto-interrupt on typing (too confusing - messages would disappear)
+*   Explicit user action required (button click or Escape key)
+*   Partial responses remain visible (not removed from history)
+
+## Related Functional Specs
+*   Functional Spec: UI/UX
+*   Related to Story 18 (Streaming) - Stop button should work with streaming too
+
+## Technical Details
+
+### Backend Cancellation (Already Implemented)
+```rust
+// In SessionState
+pub cancel_tx: watch::Sender<bool>,
+pub cancel_rx: watch::Receiver<bool>,
+
+// In chat command
+select! {
+    result = chat_future => { /* normal completion */ }
+    _ = cancel_rx.changed() => { 
+        return Err("Chat cancelled by user".to_string());
+    }
+}
+```
+
+### Frontend Integration
+```tsx
+<button
+  onClick={loading ? cancelGeneration : sendMessage}
+  disabled={!input.trim() && !loading}
+>
+  {loading ? "⬛ Stop" : "↑"}
+</button>
+
+const cancelGeneration = () => {
+  invoke("cancel_chat").catch(console.error);
+  setLoading(false);
+};
+```
+
+## Testing Considerations
+*   Test with long multi-turn generations (tool use)
+*   Test that partial responses remain visible
+*   Test that new messages can be sent after stopping
+*   Test Escape key shortcut (if implemented)
+*   Test that backend actually cancels (check Ollama logs/CPU)
--- a/.living_spec/stories/13_stop_button_on_model_response.md
+++ b/.living_spec/stories/13_stop_button_on_model_response.md
--- a/.living_spec/stories/16_move_submit_button.md
+++ b/.living_spec/stories/16_move_submit_button.md
--- a/.living_spec/stories/17_display_remaining_context.md
+++ b/.living_spec/stories/17_display_remaining_context.md
--- a/.living_spec/stories/18_streaming_responses.md
+++ b/.living_spec/stories/18_streaming_responses.md
@@ -0,0 +1,66 @@
+# Story: Token-by-Token Streaming Responses
+
+## User Story
+**As a** User
+**I want** to see the model's response appear token-by-token as it generates
+**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
+
+## Acceptance Criteria
+*   [ ] Model responses should appear token-by-token in real-time as Ollama generates them
+*   [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
+*   [ ] Tool calls should still work correctly with streaming enabled
+*   [ ] The user should see partial responses immediately, not wait for full completion
+*   [ ] Streaming should work for both text responses and responses that include tool calls
+*   [ ] Error handling should gracefully handle streaming interruptions
+*   [ ] The UI should auto-scroll to follow new tokens as they appear
+
+## Out of Scope
+*   Configurable streaming speed/throttling
+*   Showing thinking/reasoning process separately (that could be a future enhancement)
+*   Streaming for tool outputs (tool outputs can remain non-streaming)
+
+## Implementation Notes
+
+### Backend (Rust)
+*   Change `stream: false` to `stream: true` in Ollama request
+*   Parse streaming JSON response from Ollama (newline-delimited JSON)
+*   Emit `chat:token` events for each token received
+*   Handle both streaming text and tool call responses
+*   Use `reqwest` with streaming body support
+*   Consider using `futures::StreamExt` for async stream processing
+
+### Frontend (TypeScript)
+*   Listen for `chat:token` events
+*   Append tokens to the current assistant message in real-time
+*   Update the UI state without full re-renders (performance)
+*   Maintain smooth auto-scroll as tokens arrive
+*   Handle the transition from streaming text to tool calls
+
+### Ollama Streaming Format
+Ollama returns newline-delimited JSON when streaming:
+```json
+{"message":{"role":"assistant","content":"Hello"},"done":false}
+{"message":{"role":"assistant","content":" world"},"done":false}
+{"message":{"role":"assistant","content":"!"},"done":true}
+```
+
+### Challenges
+*   Parsing streaming JSON (each line is a separate JSON object)
+*   Maintaining state between streaming chunks
+*   Handling tool calls that interrupt streaming text
+*   Performance with high token throughput
+*   Error recovery if stream is interrupted
+
+## Related Functional Specs
+*   Functional Spec: UI/UX (specifically mentions streaming as deferred)
+
+## Dependencies
+*   Story 13 (interruption) should work with streaming
+*   May need `tokio-stream` or similar for stream utilities
+
+## Testing Considerations
+*   Test with long responses to verify smooth streaming
+*   Test with responses that include tool calls
+*   Test interruption during streaming
+*   Test error cases (network issues, Ollama crashes)
+*   Test performance with different token rates