Story 18: Token-by-token streaming responses
- Backend: Added OllamaProvider::chat_stream() with newline-delimited JSON parsing - Backend: Emit chat:token events for each token received from Ollama - Backend: Added futures dependency and stream feature for reqwest - Frontend: Added streamingContent state and chat:token event listener - Frontend: Real-time token display with auto-scroll - Frontend: Markdown and syntax highlighting support for streaming content - Fixed all TypeScript errors (tsc --noEmit) - Fixed all Biome warnings and errors - Fixed all Clippy warnings - Added comprehensive code quality documentation - Added tsc --noEmit to verification checklist Tested and verified: - Tokens stream in real-time - Auto-scroll works during streaming - Tool calls interrupt streaming correctly - Multi-turn conversations work - Smooth performance with no lag
This commit is contained in:
@@ -11,13 +11,28 @@ Instead of waiting for the final array of messages, the Backend should emit **Ev
|
||||
* `chat:tool-start`: Emitted when a tool call begins (e.g., `{ tool: "git status" }`).
|
||||
* `chat:tool-end`: Emitted when a tool call finishes (e.g., `{ output: "..." }`).
|
||||
|
||||
### 2. Implementation Strategy (MVP)
|
||||
For this story, we won't fully implement token streaming (as `reqwest` blocking/async mixed with stream parsing is complex). We will focus on **State Updates**:
|
||||
### 2. Implementation Strategy
|
||||
|
||||
* **Refactor `chat` command:**
|
||||
* Instead of returning `Vec<Message>` at the very end, it accepts a `AppHandle`.
|
||||
* Inside the loop, after every step (LLM response, Tool Execution), emit an event `chat:update` containing the *current partial history*.
|
||||
* The Frontend listens to `chat:update` and re-renders immediately.
|
||||
#### Token-by-Token Streaming (Story 18)
|
||||
The system now implements full token streaming for real-time response display:
|
||||
|
||||
* **Backend (Rust):**
|
||||
* Set `stream: true` in Ollama API requests
|
||||
* Parse newline-delimited JSON from Ollama's streaming response
|
||||
* Emit `chat:token` events for each token received
|
||||
* Use `reqwest` streaming body with async iteration
|
||||
* After streaming completes, emit `chat:update` with the full message
|
||||
|
||||
* **Frontend (TypeScript):**
|
||||
* Listen for `chat:token` events
|
||||
* Append tokens to the current assistant message in real-time
|
||||
* Maintain smooth auto-scroll as tokens arrive
|
||||
* After streaming completes, process `chat:update` for final state
|
||||
|
||||
* **Event-Driven Updates:**
|
||||
* `chat:token`: Emitted for each token during streaming (payload: `{ content: string }`)
|
||||
* `chat:update`: Emitted after LLM response complete or after Tool Execution (payload: `Message[]`)
|
||||
* Frontend maintains streaming state separate from message history
|
||||
|
||||
### 3. Visuals
|
||||
* **Loading State:** The "Send" button should show a spinner or "Stop" button.
|
||||
@@ -158,6 +173,55 @@ Integrate syntax highlighting into markdown code blocks rendered by the assistan
|
||||
* Ensure syntax highlighted code blocks are left-aligned
|
||||
* Test with various code samples to ensure proper rendering
|
||||
|
||||
## Token Streaming
|
||||
|
||||
### Problem
|
||||
Without streaming, users see no feedback during model generation. The response appears all at once after waiting, which feels unresponsive and provides no indication that the system is working.
|
||||
|
||||
### Solution: Token-by-Token Streaming
|
||||
Stream tokens from Ollama in real-time and display them as they arrive, providing immediate feedback and a responsive chat experience similar to ChatGPT.
|
||||
|
||||
### Requirements
|
||||
|
||||
1. **Real-time Display:** Tokens appear immediately as Ollama generates them
|
||||
2. **Smooth Performance:** No lag or stuttering during high token throughput
|
||||
3. **Tool Compatibility:** Streaming works correctly with tool calls and multi-turn conversations
|
||||
4. **Auto-scroll:** Chat view follows streaming content automatically
|
||||
5. **Error Handling:** Gracefully handle stream interruptions or errors
|
||||
6. **State Management:** Maintain clean separation between streaming state and final message history
|
||||
|
||||
### Implementation Notes
|
||||
|
||||
#### Backend (Rust)
|
||||
* Enable streaming in Ollama requests: `stream: true`
|
||||
* Parse newline-delimited JSON from response body
|
||||
* Each line is a separate JSON object: `{"message":{"content":"token"},"done":false}`
|
||||
* Use `futures::StreamExt` or similar for async stream processing
|
||||
* Emit `chat:token` event for each token
|
||||
* Emit `chat:update` when streaming completes
|
||||
* Handle both streaming text and tool call interruptions
|
||||
|
||||
#### Frontend (TypeScript)
|
||||
* Create streaming state separate from message history
|
||||
* Listen for `chat:token` events and append to streaming buffer
|
||||
* Render streaming content in real-time
|
||||
* On `chat:update`, replace streaming content with final message
|
||||
* Maintain scroll position during streaming
|
||||
|
||||
#### Ollama Streaming Format
|
||||
```json
|
||||
{"message":{"role":"assistant","content":"Hello"},"done":false}
|
||||
{"message":{"role":"assistant","content":" world"},"done":false}
|
||||
{"message":{"role":"assistant","content":"!"},"done":true}
|
||||
{"message":{"role":"assistant","tool_calls":[...]},"done":true}
|
||||
```
|
||||
|
||||
### Edge Cases
|
||||
* Tool calls during streaming: Switch from text streaming to tool execution
|
||||
* Cancellation during streaming: Clean up streaming state properly
|
||||
* Network interruptions: Show error and preserve partial content
|
||||
* Very fast streaming: Throttle UI updates if needed for performance
|
||||
|
||||
## Input Focus Management
|
||||
|
||||
### Problem
|
||||
|
||||
Reference in New Issue
Block a user