Refocus workflow on TDD and reorganize stories

2026-02-17 13:34:32 +00:00
parent 1f4f10930f
commit 4c887d93b5
42 changed files with 155 additions and 498 deletions
--- a/.story_kit/specs/tech/MODEL_SELECTION.md
+++ b/.story_kit/specs/tech/MODEL_SELECTION.md
@@ -1,139 +0,0 @@
-# Model Selection Guide
-
-## Overview
-This application requires LLM models that support **tool calling** (function calling) and are capable of **autonomous execution** rather than just code suggestion. Not all models are suitable for agentic workflows.
-
-## Recommended Models
-
-### Primary Recommendation: GPT-OSS
-
-**Model:** `gpt-oss:20b`
- **Size:** 13 GB
- **Context:** 128K tokens
- **Tool Support:** ✅ Excellent
- **Autonomous Behavior:** ✅ Excellent
- **Why:** OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses `write_file` to implement changes directly rather than suggesting code.
-
-```bash
-ollama pull gpt-oss:20b
-```
-
-### Alternative Options
-
-#### Llama 3.1 (Best Balance)
-**Model:** `llama3.1:8b`
- **Size:** 4.7 GB
- **Context:** 128K tokens
- **Tool Support:** ✅ Excellent
- **Autonomous Behavior:** ✅ Good
- **Why:** Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
-
-```bash
-ollama pull llama3.1:8b
-```
-
-#### Qwen 2.5 Coder (Coding Focused)
-**Model:** `qwen2.5-coder:7b` or `qwen2.5-coder:14b`
- **Size:** 4.5 GB / 9 GB
- **Context:** 32K tokens
- **Tool Support:** ✅ Good
- **Autonomous Behavior:** ✅ Good
- **Why:** Specifically trained for coding tasks. Note: Use Qwen **2.5**, NOT Qwen 3.
-
-```bash
-ollama pull qwen2.5-coder:7b
-# or for more capability:
-ollama pull qwen2.5-coder:14b
-```
-
-#### Mistral (General Purpose)
-**Model:** `mistral:7b`
- **Size:** 4 GB
- **Context:** 32K tokens
- **Tool Support:** ✅ Good
- **Autonomous Behavior:** ✅ Good
- **Why:** Fast, efficient, and good at following instructions.
-
-```bash
-ollama pull mistral:7b
-```
-
-## Models to Avoid
-
-### ❌ Qwen3-Coder
-**Problem:** Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using `write_file` to implement changes directly.
-
-**Status:** Works for reading files and analysis, but not recommended for autonomous coding.
-
-### ❌ DeepSeek-Coder-V2
-**Problem:** Does not support tool calling at all.
-
-**Error:** `"registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"`
-
-### ❌ StarCoder / CodeLlama (older versions)
-**Problem:** Most older coding models don't support tool calling or do it poorly.
-
-## How to Verify Tool Support
-
-Check if a model supports tools on the Ollama library page:
-```
-https://ollama.com/library/<model-name>
-```
-
-Look for the "Tools" tag in the model's capabilities.
-
-You can also check locally:
-```bash
-ollama show <model-name>
-```
-
-## Model Selection Criteria
-
-When choosing a model for autonomous coding, prioritize:
-
-1. **Tool Calling Support** - Must support function calling natively
-2. **Autonomous Behavior** - Trained to execute rather than suggest
-3. **Context Window** - Larger is better for complex projects (32K minimum, 128K ideal)
-4. **Size vs Performance** - Balance between model size and your hardware
-5. **Prompt Adherence** - Follows system instructions reliably
-
-## Testing a New Model
-
-To test if a model works for autonomous coding:
-
-1. Select it in the UI dropdown
-2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
-3. **Expected behavior:** Uses `write_file` tool and creates the file
-4. **Bad behavior:** Suggests code in markdown blocks or asks what you want to do
-
-If it suggests code instead of writing it, the model is not suitable for this application.
-
-## Context Window Management
-
-Current context usage (approximate):
- System prompts: ~1,000 tokens
- Tool definitions: ~300 tokens
- Per message overhead: ~50-100 tokens
- Average conversation: 2-5K tokens
-
-Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.
-
-## Performance Notes
-
-**Speed:** Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.
-
-**Hardware:** 
- 8B models: ~8 GB RAM
- 20B models: ~16 GB RAM
- 70B models: ~48 GB RAM (quantized)
-
-**Recommendation:** Start with `llama3.1:8b` for speed, upgrade to `gpt-oss:20b` for reliability.
-
-## Summary
-
-**For this application:**
-1. **Best overall:** `gpt-oss:20b` (proven autonomous behavior)
-2. **Best balance:** `llama3.1:8b` (fast, reliable, well-supported)
-3. **For coding:** `qwen2.5-coder:7b` (specialized, but smaller context)
-
-**Avoid:** Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.
--- a/.story_kit/specs/tech/STACK.md
+++ b/.story_kit/specs/tech/STACK.md
@@ -76,15 +76,18 @@ To support both Remote and Local models, the system implements a `ModelProvider`
 *   **Quality Gates:**
    *   `cargo clippy --all-targets --all-features` must show 0 errors, 0 warnings
    *   `cargo check` must succeed
-    *   `cargo test` must pass all tests
+    *   `cargo nextest run` must pass all tests

 ### TypeScript / React
 *   **Style:** Biome formatter (replaces Prettier/ESLint).
 *   **Linter:** Biome - Must pass with 0 errors, 0 warnings before merging.
 *   **Types:** Shared types with Rust (via `tauri-specta` or manual interface matching) are preferred to ensure type safety across the bridge.
+*   **Testing:** Vitest for unit/component tests; Playwright for end-to-end tests.
 *   **Quality Gates:**
    *   `npx @biomejs/biome check src/` must show 0 errors, 0 warnings
    *   `npm run build` must succeed
+    *   `npx vitest run` must pass
+    *   `npx playwright test` must pass
    *   No `any` types allowed (use proper types or `unknown`)
    *   React keys must use stable IDs, not array indices
    *   All buttons must have explicit `type` attribute
@@ -103,6 +106,8 @@ To support both Remote and Local models, the system implements a `ModelProvider`
    *   `poem-openapi`: OpenAPI (Swagger) for non-streaming HTTP APIs.
 *   **JavaScript:**
    *   `react-markdown`: For rendering chat responses.
+    *   `vitest`: Unit/component testing.
+    *   `playwright`: End-to-end testing.

 ## Safety & Sandbox
 1.  **Project Scope:** The application must strictly enforce that it does not read/write outside the `project_root` selected by the user.