Renamed living spec to Story Kit

2026-02-16 15:44:20 +00:00
parent 0876c53e17
commit 3865883998
35 changed files with 3 additions and 3 deletions
--- a/.story_kit/specs/tech/MODEL_SELECTION.md
+++ b/.story_kit/specs/tech/MODEL_SELECTION.md
@@ -0,0 +1,139 @@
+# Model Selection Guide
+
+## Overview
+This application requires LLM models that support **tool calling** (function calling) and are capable of **autonomous execution** rather than just code suggestion. Not all models are suitable for agentic workflows.
+
+## Recommended Models
+
+### Primary Recommendation: GPT-OSS
+
+**Model:** `gpt-oss:20b`
+- **Size:** 13 GB
+- **Context:** 128K tokens
+- **Tool Support:** ✅ Excellent
+- **Autonomous Behavior:** ✅ Excellent
+- **Why:** OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses `write_file` to implement changes directly rather than suggesting code.
+
+```bash
+ollama pull gpt-oss:20b
+```
+
+### Alternative Options
+
+#### Llama 3.1 (Best Balance)
+**Model:** `llama3.1:8b`
+- **Size:** 4.7 GB
+- **Context:** 128K tokens
+- **Tool Support:** ✅ Excellent
+- **Autonomous Behavior:** ✅ Good
+- **Why:** Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
+
+```bash
+ollama pull llama3.1:8b
+```
+
+#### Qwen 2.5 Coder (Coding Focused)
+**Model:** `qwen2.5-coder:7b` or `qwen2.5-coder:14b`
+- **Size:** 4.5 GB / 9 GB
+- **Context:** 32K tokens
+- **Tool Support:** ✅ Good
+- **Autonomous Behavior:** ✅ Good
+- **Why:** Specifically trained for coding tasks. Note: Use Qwen **2.5**, NOT Qwen 3.
+
+```bash
+ollama pull qwen2.5-coder:7b
+# or for more capability:
+ollama pull qwen2.5-coder:14b
+```
+
+#### Mistral (General Purpose)
+**Model:** `mistral:7b`
+- **Size:** 4 GB
+- **Context:** 32K tokens
+- **Tool Support:** ✅ Good
+- **Autonomous Behavior:** ✅ Good
+- **Why:** Fast, efficient, and good at following instructions.
+
+```bash
+ollama pull mistral:7b
+```
+
+## Models to Avoid
+
+### ❌ Qwen3-Coder
+**Problem:** Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using `write_file` to implement changes directly.
+
+**Status:** Works for reading files and analysis, but not recommended for autonomous coding.
+
+### ❌ DeepSeek-Coder-V2
+**Problem:** Does not support tool calling at all.
+
+**Error:** `"registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"`
+
+### ❌ StarCoder / CodeLlama (older versions)
+**Problem:** Most older coding models don't support tool calling or do it poorly.
+
+## How to Verify Tool Support
+
+Check if a model supports tools on the Ollama library page:
+```
+https://ollama.com/library/<model-name>
+```
+
+Look for the "Tools" tag in the model's capabilities.
+
+You can also check locally:
+```bash
+ollama show <model-name>
+```
+
+## Model Selection Criteria
+
+When choosing a model for autonomous coding, prioritize:
+
+1. **Tool Calling Support** - Must support function calling natively
+2. **Autonomous Behavior** - Trained to execute rather than suggest
+3. **Context Window** - Larger is better for complex projects (32K minimum, 128K ideal)
+4. **Size vs Performance** - Balance between model size and your hardware
+5. **Prompt Adherence** - Follows system instructions reliably
+
+## Testing a New Model
+
+To test if a model works for autonomous coding:
+
+1. Select it in the UI dropdown
+2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
+3. **Expected behavior:** Uses `write_file` tool and creates the file
+4. **Bad behavior:** Suggests code in markdown blocks or asks what you want to do
+
+If it suggests code instead of writing it, the model is not suitable for this application.
+
+## Context Window Management
+
+Current context usage (approximate):
+- System prompts: ~1,000 tokens
+- Tool definitions: ~300 tokens
+- Per message overhead: ~50-100 tokens
+- Average conversation: 2-5K tokens
+
+Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.
+
+## Performance Notes
+
+**Speed:** Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.
+
+**Hardware:** 
+- 8B models: ~8 GB RAM
+- 20B models: ~16 GB RAM
+- 70B models: ~48 GB RAM (quantized)
+
+**Recommendation:** Start with `llama3.1:8b` for speed, upgrade to `gpt-oss:20b` for reliability.
+
+## Summary
+
+**For this application:**
+1. **Best overall:** `gpt-oss:20b` (proven autonomous behavior)
+2. **Best balance:** `llama3.1:8b` (fast, reliable, well-supported)
+3. **For coding:** `qwen2.5-coder:7b` (specialized, but smaller context)
+
+**Avoid:** Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.
--- a/.story_kit/specs/tech/STACK.md
+++ b/.story_kit/specs/tech/STACK.md
@@ -0,0 +1,111 @@
+# Tech Stack & Constraints
+
+## Overview
+This project is a standalone Rust **web server binary** that serves a Vite/React frontend and exposes a **WebSocket API**. The built frontend assets are packaged with the binary (in a `frontend` directory) and served as static files. It functions as an **Agentic Code Assistant** capable of safely executing tools on the host system.
+
+## Core Stack
+*   **Backend:** Rust (Web Server)
+    *   **MSRV:** Stable (latest)
+    *   **Framework:** Poem HTTP server with WebSocket support for streaming; HTTP APIs should use Poem OpenAPI (Swagger) for non-streaming endpoints.
+*   **Frontend:** TypeScript + React
+    *   **Build Tool:** Vite
+    *   **Styling:** CSS Modules or Tailwind (TBD - Defaulting to CSS Modules)
+    *   **State Management:** React Context / Hooks
+    *   **Chat UI:** Rendered Markdown with syntax highlighting.
+
+## Agent Architecture
+The application follows a **Tool-Use (Function Calling)** architecture:
+1.  **Frontend:** Collects user input and sends it to the LLM.
+2.  **LLM:** Decides to generate text OR request a **Tool Call** (e.g., `execute_shell`, `read_file`).
+3.  **Web Server Backend (The "Hand"):**
+    *   Intercepts Tool Calls.
+    *   Validates the request against the **Safety Policy**.
+    *   Executes the native code (File I/O, Shell Process, Search).
+    *   Returns the output (stdout/stderr/file content) to the LLM.
+    *   **Streaming:** The backend sends real-time updates over WebSocket to keep the UI responsive during long-running Agent tasks.
+
+## LLM Provider Abstraction
+To support both Remote and Local models, the system implements a `ModelProvider` abstraction layer.
+
+*   **Strategy:**
+    *   Abstract the differences between API formats (OpenAI-compatible vs Anthropic vs Gemini).
+    *   Normalize "Tool Use" definitions, as each provider handles function calling schemas differently.
+*   **Supported Providers:**
+    *   **Ollama:** Local inference (e.g., Llama 3, DeepSeek Coder) for privacy and offline usage.
+    *   **Anthropic:** Claude 3.5 models (Sonnet, Haiku) via API for coding tasks (Story 12).
+*   **Provider Selection:**
+    *   Automatic detection based on model name prefix:
+        *   `claude-` → Anthropic API
+        *   Otherwise → Ollama
+    *   Single unified model dropdown with section headers ("Anthropic", "Ollama")
+*   **API Key Management:**
+    *   Anthropic API key stored server-side and persisted securely
+    *   On first use of Claude model, user prompted to enter API key
+    *   Key persists across sessions (no re-entry needed)
+
+## Tooling Capabilities
+
+### 1. Filesystem (Native)
+*   **Scope:** Strictly limited to the user-selected `project_root`.
+*   **Operations:** Read, Write, List, Delete.
+*   **Constraint:** Modifications to `.git/` are strictly forbidden via file APIs (use Git tools instead).
+
+### 2. Shell Execution
+*   **Library:** `tokio::process` for async execution.
+*   **Constraint:** We do **not** run an interactive shell (repl). We run discrete, stateless commands.
+*   **Allowlist:** The agent may only execute specific binaries:
+    *   `git`
+    *   `cargo`, `rustc`, `rustfmt`, `clippy`
+    *   `npm`, `node`, `yarn`, `pnpm`, `bun`
+    *   `ls`, `find`, `grep` (if not using internal search)
+    *   `mkdir`, `rm`, `touch`, `mv`, `cp`
+
+### 3. Search & Navigation
+*   **Library:** `ignore` (by BurntSushi) + `grep` logic.
+*   **Behavior:**
+    *   Must respect `.gitignore` files automatically.
+    *   Must be performant (parallel traversal).
+
+## Coding Standards
+
+### Rust
+*   **Style:** `rustfmt` standard.
+*   **Linter:** `clippy` - Must pass with 0 warnings before merging.
+*   **Error Handling:** Custom `AppError` type deriving `thiserror`. All Commands return `Result<T, AppError>`.
+*   **Concurrency:** Heavy tools (Search, Shell) must run on `tokio` threads to avoid blocking the UI.
+*   **Quality Gates:**
+    *   `cargo clippy --all-targets --all-features` must show 0 errors, 0 warnings
+    *   `cargo check` must succeed
+    *   `cargo test` must pass all tests
+
+### TypeScript / React
+*   **Style:** Biome formatter (replaces Prettier/ESLint).
+*   **Linter:** Biome - Must pass with 0 errors, 0 warnings before merging.
+*   **Types:** Shared types with Rust (via `tauri-specta` or manual interface matching) are preferred to ensure type safety across the bridge.
+*   **Quality Gates:**
+    *   `npx @biomejs/biome check src/` must show 0 errors, 0 warnings
+    *   `npm run build` must succeed
+    *   No `any` types allowed (use proper types or `unknown`)
+    *   React keys must use stable IDs, not array indices
+    *   All buttons must have explicit `type` attribute
+
+## Libraries (Approved)
+*   **Rust:**
+    *   `serde`, `serde_json`: Serialization.
+    *   `ignore`: Fast recursive directory iteration respecting gitignore.
+    *   `walkdir`: Simple directory traversal.
+    *   `tokio`: Async runtime.
+    *   `reqwest`: For LLM API calls (Anthropic, Ollama).
+    *   `eventsource-stream`: For Server-Sent Events (Anthropic streaming).
+    *   `uuid`: For unique message IDs.
+    *   `chrono`: For timestamps.
+    *   `poem`: HTTP server framework.
+    *   `poem-openapi`: OpenAPI (Swagger) for non-streaming HTTP APIs.
+*   **JavaScript:**
+    *   `react-markdown`: For rendering chat responses.
+
+## Safety & Sandbox
+1.  **Project Scope:** The application must strictly enforce that it does not read/write outside the `project_root` selected by the user.
+2.  **Human in the Loop:**
+    *   Shell commands that modify state (non-readonly) should ideally require a UI confirmation (configurable).
+    *   File writes must be confirmed or revertible.