.living_spec/stories/18_streaming_responses.md

# Story: Token-by-Token Streaming Responses

## User Story
**As a** User
**I want** to see the model's response appear token-by-token as it generates
**So that** I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.

## Acceptance Criteria
*   [ ] Model responses should appear token-by-token in real-time as Ollama generates them
*   [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)
*   [ ] Tool calls should still work correctly with streaming enabled
*   [ ] The user should see partial responses immediately, not wait for full completion
*   [ ] Streaming should work for both text responses and responses that include tool calls
*   [ ] Error handling should gracefully handle streaming interruptions
*   [ ] The UI should auto-scroll to follow new tokens as they appear

## Out of Scope
*   Configurable streaming speed/throttling
*   Showing thinking/reasoning process separately (that could be a future enhancement)
*   Streaming for tool outputs (tool outputs can remain non-streaming)

## Implementation Notes

### Backend (Rust)
*   Change `stream: false` to `stream: true` in Ollama request
*   Parse streaming JSON response from Ollama (newline-delimited JSON)
*   Emit `chat:token` events for each token received
*   Handle both streaming text and tool call responses
*   Use `reqwest` with streaming body support
*   Consider using `futures::StreamExt` for async stream processing

### Frontend (TypeScript)
*   Listen for `chat:token` events
*   Append tokens to the current assistant message in real-time
*   Update the UI state without full re-renders (performance)
*   Maintain smooth auto-scroll as tokens arrive
*   Handle the transition from streaming text to tool calls

### Ollama Streaming Format
Ollama returns newline-delimited JSON when streaming:
```json
{"message":{"role":"assistant","content":"Hello"},"done":false}
{"message":{"role":"assistant","content":" world"},"done":false}
{"message":{"role":"assistant","content":"!"},"done":true}
```

### Challenges
*   Parsing streaming JSON (each line is a separate JSON object)
*   Maintaining state between streaming chunks
*   Handling tool calls that interrupt streaming text
*   Performance with high token throughput
*   Error recovery if stream is interrupted

## Related Functional Specs
*   Functional Spec: UI/UX (specifically mentions streaming as deferred)

## Dependencies
*   Story 13 (interruption) should work with streaming
*   May need `tokio-stream` or similar for stream utilities

## Testing Considerations
*   Test with long responses to verify smooth streaming
*   Test with responses that include tool calls
*   Test interruption during streaming
*   Test error cases (network issues, Ollama crashes)
*   Test performance with different token rates
feat: Backend cancellation support for interrupting model responses Merged from feature/interrupt-on-type branch. Backend cancellation infrastructure: - Added tokio watch channel to SessionState for cancellation signaling - Implemented cancel_chat command - Modified chat command to use tokio::select! for racing requests vs cancellation - When cancelled, HTTP request to Ollama is dropped and returns early - Added tokio dependency with sync feature Story updates: - Story 13: Updated to use Stop button pattern (industry standard) - Story 18: Created placeholder for streaming responses - Stories 15-17: Placeholders for future features Frontend changes: - Removed auto-interrupt on typing behavior (too confusing) - Backend infrastructure ready for Stop button implementation Note: Story 13 UI (Stop button) not yet implemented - backend ready 2025-12-27 15:36:58 +00:00			`# Story: Token-by-Token Streaming Responses`

			`## User Story`
			`As a User`
			`I want to see the model's response appear token-by-token as it generates`
			`So that I get immediate feedback and can see the model is working, rather than waiting for the entire response to complete.`

			`## Acceptance Criteria`
			`* [ ] Model responses should appear token-by-token in real-time as Ollama generates them`
			`* [ ] The streaming should feel smooth and responsive (like ChatGPT's typing effect)`
			`* [ ] Tool calls should still work correctly with streaming enabled`
			`* [ ] The user should see partial responses immediately, not wait for full completion`
			`* [ ] Streaming should work for both text responses and responses that include tool calls`
			`* [ ] Error handling should gracefully handle streaming interruptions`
			`* [ ] The UI should auto-scroll to follow new tokens as they appear`

			`## Out of Scope`
			`* Configurable streaming speed/throttling`
			`* Showing thinking/reasoning process separately (that could be a future enhancement)`
			`* Streaming for tool outputs (tool outputs can remain non-streaming)`

			`## Implementation Notes`

			`### Backend (Rust)`
			* Change `stream: false` to `stream: true` in Ollama request
			`* Parse streaming JSON response from Ollama (newline-delimited JSON)`
			* Emit `chat:token` events for each token received
			`* Handle both streaming text and tool call responses`
			* Use `reqwest` with streaming body support
			* Consider using `futures::StreamExt` for async stream processing

			`### Frontend (TypeScript)`
			* Listen for `chat:token` events
			`* Append tokens to the current assistant message in real-time`
			`* Update the UI state without full re-renders (performance)`
			`* Maintain smooth auto-scroll as tokens arrive`
			`* Handle the transition from streaming text to tool calls`

			`### Ollama Streaming Format`
			`Ollama returns newline-delimited JSON when streaming:`
			```json
			`{"message":{"role":"assistant","content":"Hello"},"done":false}`
			`{"message":{"role":"assistant","content":" world"},"done":false}`
			`{"message":{"role":"assistant","content":"!"},"done":true}`
			```

			`### Challenges`
			`* Parsing streaming JSON (each line is a separate JSON object)`
			`* Maintaining state between streaming chunks`
			`* Handling tool calls that interrupt streaming text`
			`* Performance with high token throughput`
			`* Error recovery if stream is interrupted`

			`## Related Functional Specs`
			`* Functional Spec: UI/UX (specifically mentions streaming as deferred)`

			`## Dependencies`
			`* Story 13 (interruption) should work with streaming`
			* May need `tokio-stream` or similar for stream utilities

			`## Testing Considerations`
			`* Test with long responses to verify smooth streaming`
			`* Test with responses that include tool calls`
			`* Test interruption during streaming`
			`* Test error cases (network issues, Ollama crashes)`
			`* Test performance with different token rates`