Files

Dave bb700ce870 feat: Backend cancellation support for interrupting model responses

Merged from feature/interrupt-on-type branch.

Backend cancellation infrastructure:
- Added tokio watch channel to SessionState for cancellation signaling
- Implemented cancel_chat command
- Modified chat command to use tokio::select! for racing requests vs cancellation
- When cancelled, HTTP request to Ollama is dropped and returns early
- Added tokio dependency with sync feature

Story updates:
- Story 13: Updated to use Stop button pattern (industry standard)
- Story 18: Created placeholder for streaming responses
- Stories 15-17: Placeholders for future features

Frontend changes:
- Removed auto-interrupt on typing behavior (too confusing)
- Backend infrastructure ready for Stop button implementation

Note: Story 13 UI (Stop button) not yet implemented - backend ready

2025-12-27 15:36:58 +00:00

2.8 KiB

Raw Blame History

Story: Token-by-Token Streaming Responses

User Story

As a User I want to see the model's response appear token-by-token as it generates So that I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.

Acceptance Criteria

Model responses should appear token-by-token in real-time as Ollama generates them
The streaming should feel smooth and responsive (like ChatGPT's typing effect)
Tool calls should still work correctly with streaming enabled
The user should see partial responses immediately, not wait for full completion
Streaming should work for both text responses and responses that include tool calls
Error handling should gracefully handle streaming interruptions
The UI should auto-scroll to follow new tokens as they appear

Out of Scope

Configurable streaming speed/throttling
Showing thinking/reasoning process separately (that could be a future enhancement)
Streaming for tool outputs (tool outputs can remain non-streaming)

Implementation Notes

Backend (Rust)

Change stream: false to stream: true in Ollama request
Parse streaming JSON response from Ollama (newline-delimited JSON)
Emit chat:token events for each token received
Handle both streaming text and tool call responses
Use reqwest with streaming body support
Consider using futures::StreamExt for async stream processing

Frontend (TypeScript)

Listen for chat:token events
Append tokens to the current assistant message in real-time
Update the UI state without full re-renders (performance)
Maintain smooth auto-scroll as tokens arrive
Handle the transition from streaming text to tool calls

Ollama Streaming Format

Ollama returns newline-delimited JSON when streaming:

{"message":{"role":"assistant","content":"Hello"},"done":false}
{"message":{"role":"assistant","content":" world"},"done":false}
{"message":{"role":"assistant","content":"!"},"done":true}

Challenges

Parsing streaming JSON (each line is a separate JSON object)
Maintaining state between streaming chunks
Handling tool calls that interrupt streaming text
Performance with high token throughput
Error recovery if stream is interrupted

Functional Spec: UI/UX (specifically mentions streaming as deferred)

Dependencies

Story 13 (interruption) should work with streaming
May need tokio-stream or similar for stream utilities

Testing Considerations

Test with long responses to verify smooth streaming
Test with responses that include tool calls
Test interruption during streaming
Test error cases (network issues, Ollama crashes)
Test performance with different token rates

2.8 KiB Raw Blame History