Merged from feature/interrupt-on-type branch. Backend cancellation infrastructure: - Added tokio watch channel to SessionState for cancellation signaling - Implemented cancel_chat command - Modified chat command to use tokio::select! for racing requests vs cancellation - When cancelled, HTTP request to Ollama is dropped and returns early - Added tokio dependency with sync feature Story updates: - Story 13: Updated to use Stop button pattern (industry standard) - Story 18: Created placeholder for streaming responses - Stories 15-17: Placeholders for future features Frontend changes: - Removed auto-interrupt on typing behavior (too confusing) - Backend infrastructure ready for Stop button implementation Note: Story 13 UI (Stop button) not yet implemented - backend ready
2.8 KiB
2.8 KiB
Story: Token-by-Token Streaming Responses
User Story
As a User I want to see the model's response appear token-by-token as it generates So that I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.
Acceptance Criteria
- Model responses should appear token-by-token in real-time as Ollama generates them
- The streaming should feel smooth and responsive (like ChatGPT's typing effect)
- Tool calls should still work correctly with streaming enabled
- The user should see partial responses immediately, not wait for full completion
- Streaming should work for both text responses and responses that include tool calls
- Error handling should gracefully handle streaming interruptions
- The UI should auto-scroll to follow new tokens as they appear
Out of Scope
- Configurable streaming speed/throttling
- Showing thinking/reasoning process separately (that could be a future enhancement)
- Streaming for tool outputs (tool outputs can remain non-streaming)
Implementation Notes
Backend (Rust)
- Change
stream: falsetostream: truein Ollama request - Parse streaming JSON response from Ollama (newline-delimited JSON)
- Emit
chat:tokenevents for each token received - Handle both streaming text and tool call responses
- Use
reqwestwith streaming body support - Consider using
futures::StreamExtfor async stream processing
Frontend (TypeScript)
- Listen for
chat:tokenevents - Append tokens to the current assistant message in real-time
- Update the UI state without full re-renders (performance)
- Maintain smooth auto-scroll as tokens arrive
- Handle the transition from streaming text to tool calls
Ollama Streaming Format
Ollama returns newline-delimited JSON when streaming:
{"message":{"role":"assistant","content":"Hello"},"done":false}
{"message":{"role":"assistant","content":" world"},"done":false}
{"message":{"role":"assistant","content":"!"},"done":true}
Challenges
- Parsing streaming JSON (each line is a separate JSON object)
- Maintaining state between streaming chunks
- Handling tool calls that interrupt streaming text
- Performance with high token throughput
- Error recovery if stream is interrupted
Related Functional Specs
- Functional Spec: UI/UX (specifically mentions streaming as deferred)
Dependencies
- Story 13 (interruption) should work with streaming
- May need
tokio-streamor similar for stream utilities
Testing Considerations
- Test with long responses to verify smooth streaming
- Test with responses that include tool calls
- Test interruption during streaming
- Test error cases (network issues, Ollama crashes)
- Test performance with different token rates