feat: Story 8 - Collapsible tool outputs + autonomous coding improvements

Implemented Story 8: Collapsible Tool Outputs - Tool outputs now render in <details>/<summary> elements, collapsed by default - Summary shows tool name with key argument (e.g., ▶ read_file(src/main.rs)) - Added arrow rotation animation and scrollable content (max 300px) - Enhanced tool_calls display to show arguments inline - Added CSS styling for dark theme consistency Fixed: LLM autonomous coding behavior - Strengthened system prompt with explicit examples and directives - Implemented triple-reinforcement system (primary prompt + reminder + message prefixes) - Improved tool descriptions to be more explicit and action-oriented - Increased MAX_TURNS from 10 to 30 for complex agentic workflows - Added debug logging for Ollama requests/responses - Result: GPT-OSS (gpt-oss:20b) now successfully uses write_file autonomously Documentation improvements - Created MODEL_SELECTION.md guide with recommendations - Updated PERSONA.md spec to emphasize autonomous agent behavior - Updated UI_UX.md spec with collapsible tool output requirements - Updated SDSW workflow: LLM archives stories and performs squash merge Cleanup - Removed unused ToolTester.tsx component
2025-12-25 15:18:12 +00:00
parent c493da2f2a
commit 990441dfc1
17 changed files with 471 additions and 172 deletions
--- a/.living_spec/README.md
+++ b/.living_spec/README.md
@@ -66,8 +66,11 @@ When the user asks for a feature, follow this 4-step loop strictly:

 ### Step 4: Verification (Close)
 *   **Action:** Write a test case that maps directly to the Acceptance Criteria in the Story.
-   **Action:** Run compilation and make sure it succeeds without errors. Fix warnings if possible. Run tests and make sure they all pass before proceeding. Ask questions here if needed.
-*  **Action:** Ask the user to accept the story. Move to `stories/archive/`. Tell the user to **Squash Merge** the feature branch (e.g. `git merge --squash feature/story-name`) and commit. This ensures the main history reflects one atomic commit per Story.
+*   **Action:** Run compilation and make sure it succeeds without errors. Fix warnings if possible. Run tests and make sure they all pass before proceeding. Ask questions here if needed.
+*   **Action:** Ask the user to accept the story.
+*   **Action:** When the user accepts, move the story file to `stories/archive/` (e.g., `mv stories/XX_story_name.md stories/archive/`).
+*   **Action:** Commit the archive move to the feature branch.
+*   **Action:** Tell the user to **Squash Merge** the feature branch (e.g., `git merge --squash feature/story-name`) and commit. This ensures the main history reflects one atomic commit per Story, including the archived story file.


 ---
--- a/.living_spec/specs/functional/PERSONA.md
+++ b/.living_spec/specs/functional/PERSONA.md
@@ -2,23 +2,47 @@

 ## 1. Role Definition
 The Agent acts as a **Senior Software Engineer** embedded within the user's local environment.
+**Critical:** The Agent is NOT a chatbot that suggests code. It is an AUTONOMOUS AGENT that directly executes changes via tools.

 ## 2. Directives
 The System Prompt must enforce the following behaviors:
-1.  **Tool First:** Do not guess code. Read files first.
-2.  **Conciseness:** Do not explain "I will now do X". Just do X (call the tool).
-3.  **Safety:** Never modify files outside the scope (though backend enforces this, the LLM should know).
-4.  **Format:** When writing code, write the *whole* file if the tool requires it, or handle partials if we upgrade the tool (currently `write_file` is overwrite).
+1.  **Action Over Suggestion:** When asked to write, create, or modify code, the Agent MUST use tools (`write_file`, `read_file`, etc.) to directly implement the changes. It must NEVER respond with code suggestions or instructions for the user to follow.
+2.  **Tool First:** Do not guess code. Read files first using `read_file`.
+3.  **Proactive Execution:** When the user requests a feature or change:
+    *   Read relevant files to understand context
+    *   Write the actual code using `write_file`
+    *   Verify the changes (e.g., run tests, check syntax)
+    *   Report completion, not suggestions
+4.  **Conciseness:** Do not explain "I will now do X". Just do X (call the tool).
+5.  **Safety:** Never modify files outside the scope (though backend enforces this, the LLM should know).
+6.  **Format:** When writing code, write the *whole* file if the tool requires it, or handle partials if we upgrade the tool (currently `write_file` is overwrite).

 ## 3. Implementation
 *   **Location:** `src-tauri/src/llm/prompts.rs`
 *   **Injection:** The system message is prepended to the `messages` vector in `chat::chat` before sending to the Provider.
+*   **Reinforcement System:** For stubborn models that ignore directives, we implement a triple-reinforcement approach:
+    1. **Primary System Prompt** (index 0): Full instructions with examples
+    2. **Aggressive Reminder** (index 1): A second system message with critical reminders about using tools
+    3. **User Message Prefix**: Each user message is prefixed with `[AGENT DIRECTIVE: You must use write_file tool to implement changes. Never suggest code.]`
 *   **Deduplication:** Ensure we don't stack multiple system messages if the loop runs long (though currently we reconstruct history per turn).

-## 4. The Prompt Text (Draft)
-"You are a Senior Software Engineer Agent running in a local Tauri environment.
-You have access to the user's filesystem via tools.
- ALWAYS read files before modifying them to understand context.
- When asked to create or edit, use 'write_file'.
- 'write_file' overwrites the ENTIRE content. Do not write partial diffs.
- Be concise. Use tools immediately."
+## 4. The Prompt Text Requirements
+The system prompt must emphasize:
+*   **Identity:** "You are an AI Agent with direct filesystem access"
+*   **Prohibition:** "DO NOT suggest code to the user. DO NOT output code blocks for the user to copy."
+*   **Mandate:** "When asked to implement something, USE the tools to directly write files."
+*   **Process:** "Read first, then write. Verify your work."
+*   **Tool Reminder:** List available tools explicitly and remind the Agent to use them.
+
+## 5. Target Models
+This prompt must work effectively with:
+*   **Local Models:** Qwen, DeepSeek Coder, CodeLlama, Mistral, Llama 3.x
+*   **Remote Models:** Claude, GPT-4, Gemini
+
+Some local models require more explicit instructions about tool usage. The prompt should be unambiguous.
+
+## 6. Handling Stubborn Models
+Some models (particularly coding assistants trained to suggest rather than execute) may resist using write_file even with clear instructions. For these models:
+*   **Use the triple-reinforcement system** (primary prompt + reminder + message prefixes)
+*   **Consider alternative models** that are better trained for autonomous execution (e.g., DeepSeek-Coder-V2, Llama 3.1)
+*   **Known issues:** Qwen3-Coder models tend to suggest code rather than write it directly, despite tool calling support
--- a/.living_spec/specs/functional/UI_UX.md
+++ b/.living_spec/specs/functional/UI_UX.md
@@ -22,3 +22,34 @@ For this story, we won't fully implement token streaming (as `reqwest` blocking/
 ### 3. Visuals
 *   **Loading State:** The "Send" button should show a spinner or "Stop" button.
 *   **Auto-Scroll:** The chat view should stick to the bottom as new events arrive.
+
+## Tool Output Display
+
+### Problem
+Tool outputs (like file contents, search results, or command output) can be very long, making the chat history difficult to read. Users need to see the Agent's reasoning and responses without being overwhelmed by verbose tool output.
+
+### Solution: Collapsible Tool Outputs
+Tool outputs should be rendered in a collapsible component that is **closed by default**.
+
+### Requirements
+
+1. **Default State:** Tool outputs are collapsed/closed when first rendered
+2. **Summary Line:** Shows essential information without expanding:
+   - Tool name (e.g., `read_file`, `exec_shell`)
+   - Key arguments (e.g., file path, command name)
+   - Format: "▶ tool_name(key_arg)"
+   - Example: "▶ read_file(src/main.rs)"
+   - Example: "▶ exec_shell(cargo check)"
+3. **Expandable:** User can click the summary to toggle expansion
+4. **Output Display:** When expanded, shows the complete tool output in a readable format:
+   - Use `<pre>` or monospace font for code/terminal output
+   - Preserve whitespace and line breaks
+   - Limit height with scrolling for very long outputs (e.g., max-height: 300px)
+5. **Visual Indicator:** Clear arrow or icon showing collapsed/expanded state
+6. **Styling:** Consistent with the dark theme, distinguishable from assistant messages
+
+### Implementation Notes
+*   Use native `<details>` and `<summary>` HTML elements for accessibility
+*   Or implement custom collapsible component with proper ARIA attributes
+*   Tool outputs should be visually distinct (border, background color, or badge)
+*   Multiple tool calls in sequence should each be independently collapsible
--- a/.living_spec/specs/tech/MODEL_SELECTION.md
+++ b/.living_spec/specs/tech/MODEL_SELECTION.md
@@ -0,0 +1,139 @@
+# Model Selection Guide
+
+## Overview
+This application requires LLM models that support **tool calling** (function calling) and are capable of **autonomous execution** rather than just code suggestion. Not all models are suitable for agentic workflows.
+
+## Recommended Models
+
+### Primary Recommendation: GPT-OSS
+
+**Model:** `gpt-oss:20b`
+- **Size:** 13 GB
+- **Context:** 128K tokens
+- **Tool Support:** ✅ Excellent
+- **Autonomous Behavior:** ✅ Excellent
+- **Why:** OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses `write_file` to implement changes directly rather than suggesting code.
+
+```bash
+ollama pull gpt-oss:20b
+```
+
+### Alternative Options
+
+#### Llama 3.1 (Best Balance)
+**Model:** `llama3.1:8b`
+- **Size:** 4.7 GB
+- **Context:** 128K tokens
+- **Tool Support:** ✅ Excellent
+- **Autonomous Behavior:** ✅ Good
+- **Why:** Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
+
+```bash
+ollama pull llama3.1:8b
+```
+
+#### Qwen 2.5 Coder (Coding Focused)
+**Model:** `qwen2.5-coder:7b` or `qwen2.5-coder:14b`
+- **Size:** 4.5 GB / 9 GB
+- **Context:** 32K tokens
+- **Tool Support:** ✅ Good
+- **Autonomous Behavior:** ✅ Good
+- **Why:** Specifically trained for coding tasks. Note: Use Qwen **2.5**, NOT Qwen 3.
+
+```bash
+ollama pull qwen2.5-coder:7b
+# or for more capability:
+ollama pull qwen2.5-coder:14b
+```
+
+#### Mistral (General Purpose)
+**Model:** `mistral:7b`
+- **Size:** 4 GB
+- **Context:** 32K tokens
+- **Tool Support:** ✅ Good
+- **Autonomous Behavior:** ✅ Good
+- **Why:** Fast, efficient, and good at following instructions.
+
+```bash
+ollama pull mistral:7b
+```
+
+## Models to Avoid
+
+### ❌ Qwen3-Coder
+**Problem:** Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using `write_file` to implement changes directly.
+
+**Status:** Works for reading files and analysis, but not recommended for autonomous coding.
+
+### ❌ DeepSeek-Coder-V2
+**Problem:** Does not support tool calling at all.
+
+**Error:** `"registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"`
+
+### ❌ StarCoder / CodeLlama (older versions)
+**Problem:** Most older coding models don't support tool calling or do it poorly.
+
+## How to Verify Tool Support
+
+Check if a model supports tools on the Ollama library page:
+```
+https://ollama.com/library/<model-name>
+```
+
+Look for the "Tools" tag in the model's capabilities.
+
+You can also check locally:
+```bash
+ollama show <model-name>
+```
+
+## Model Selection Criteria
+
+When choosing a model for autonomous coding, prioritize:
+
+1. **Tool Calling Support** - Must support function calling natively
+2. **Autonomous Behavior** - Trained to execute rather than suggest
+3. **Context Window** - Larger is better for complex projects (32K minimum, 128K ideal)
+4. **Size vs Performance** - Balance between model size and your hardware
+5. **Prompt Adherence** - Follows system instructions reliably
+
+## Testing a New Model
+
+To test if a model works for autonomous coding:
+
+1. Select it in the UI dropdown
+2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
+3. **Expected behavior:** Uses `write_file` tool and creates the file
+4. **Bad behavior:** Suggests code in markdown blocks or asks what you want to do
+
+If it suggests code instead of writing it, the model is not suitable for this application.
+
+## Context Window Management
+
+Current context usage (approximate):
+- System prompts: ~1,000 tokens
+- Tool definitions: ~300 tokens
+- Per message overhead: ~50-100 tokens
+- Average conversation: 2-5K tokens
+
+Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.
+
+## Performance Notes
+
+**Speed:** Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.
+
+**Hardware:** 
+- 8B models: ~8 GB RAM
+- 20B models: ~16 GB RAM
+- 70B models: ~48 GB RAM (quantized)
+
+**Recommendation:** Start with `llama3.1:8b` for speed, upgrade to `gpt-oss:20b` for reliability.
+
+## Summary
+
+**For this application:**
+1. **Best overall:** `gpt-oss:20b` (proven autonomous behavior)
+2. **Best balance:** `llama3.1:8b` (fast, reliable, well-supported)
+3. **For coding:** `qwen2.5-coder:7b` (specialized, but smaller context)
+
+**Avoid:** Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.
--- a/.living_spec/stories/08_collapsible_tool_outputs.md
+++ b/.living_spec/stories/08_collapsible_tool_outputs.md
@@ -1,15 +0,0 @@
-# Story: Collapsible Tool Outputs
-
-## User Story
-**As a** User
-**I want** tool outputs (like long file contents or search results) to be collapsed by default
-**So that** the chat history remains readable and I can focus on the Agent's reasoning.
-
-## Acceptance Criteria
-*   [ ] Frontend: Render tool outputs inside a `<details>` / `<summary>` component (or custom equivalent).
-*   [ ] Frontend: Default state should be **Closed/Collapsed**.
-*   [ ] Frontend: The summary line should show the Tool Name + minimal args (e.g., "▶ read_file(src/main.rs)").
-*   [ ] Frontend: Clicking the arrow/summary expands to show the full output.
-
-## Out of Scope
-*   Complex syntax highlighting for tool outputs (plain text/pre is fine).
--- a/.living_spec/stories/09_remove_scroll_bars.md
+++ b/.living_spec/stories/09_remove_scroll_bars.md
@@ -0,0 +1,3 @@
+there is a scroll bar on the right that looks gross. also a horizontal scroll bar that should come out
+
+story needs to be worked through
--- a/.living_spec/stories/10_persist_model_selection.md
+++ b/.living_spec/stories/10_persist_model_selection.md
@@ -1,15 +0,0 @@
-# Story: Persist Model Selection
-
-## User Story
-**As a** User
-**I want** the application to remember which LLM model I selected
-**So that** I don't have to switch from "llama3" to "deepseek" every time I launch the app.
-
-## Acceptance Criteria
-*   [ ] Backend/Frontend: Use `tauri-plugin-store` to save the `selected_model` string.
-*   [ ] Frontend: On mount (after fetching available models), check the store.
-*   [ ] Frontend: If the stored model exists in the available list, select it.
-*   [ ] Frontend: When the user changes the dropdown, update the store.
-
-## Out of Scope
-*   Persisting per-project model settings (global setting is fine for now).
--- a/.living_spec/stories/10_tauri_resume_size_and_position_on_mac.md
+++ b/.living_spec/stories/10_tauri_resume_size_and_position_on_mac.md
@@ -0,0 +1 @@
+this story needs to be worked on
--- a/.living_spec/stories/11_make_text_not_centred.md
+++ b/.living_spec/stories/11_make_text_not_centred.md
@@ -0,0 +1 @@
+all text in the chat window is currently centred, which is weird especially for code. Make it more readable.
--- a/.living_spec/stories/12_be_able_to_use_claude.md
+++ b/.living_spec/stories/12_be_able_to_use_claude.md
--- a/.living_spec/stories/archive/08_collapsible_tool_outputs.md
+++ b/.living_spec/stories/archive/08_collapsible_tool_outputs.md
@@ -0,0 +1,25 @@
+# Story: Collapsible Tool Outputs
+
+## User Story
+**As a** User
+**I want** tool outputs (like long file contents or search results) to be collapsed by default
+**So that** the chat history remains readable and I can focus on the Agent's reasoning.
+
+## Acceptance Criteria
+*   [x] Frontend: Render tool outputs inside a `<details>` / `<summary>` component (or custom equivalent).
+*   [x] Frontend: Default state should be **Closed/Collapsed**.
+*   [x] Frontend: The summary line should show the Tool Name + minimal args (e.g., "▶ read_file(src/main.rs)").
+*   [x] Frontend: Clicking the arrow/summary expands to show the full output.
+
+## Out of Scope
+*   Complex syntax highlighting for tool outputs (plain text/pre is fine).
+
+## Implementation Plan
+1. Create a reusable component for displaying tool outputs with collapsible functionality
+2. Update the chat message rendering logic to use this component for tool outputs
+3. Ensure the summary line displays tool name and minimal arguments
+4. Verify that the component maintains proper styling and readability
+5. Test expand/collapse functionality across different tool output types
+
+## Related Functional Specs
+*   Functional Spec: Tool Outputs
				`@@ -0,0 +1 @@`
				`all text in the chat window is currently centred, which is weird especially for code. Make it more readable.`