Renamed living spec to Story Kit
This commit is contained in:
139
.story_kit/specs/tech/MODEL_SELECTION.md
Normal file
139
.story_kit/specs/tech/MODEL_SELECTION.md
Normal file
@@ -0,0 +1,139 @@
|
||||
# Model Selection Guide
|
||||
|
||||
## Overview
|
||||
This application requires LLM models that support **tool calling** (function calling) and are capable of **autonomous execution** rather than just code suggestion. Not all models are suitable for agentic workflows.
|
||||
|
||||
## Recommended Models
|
||||
|
||||
### Primary Recommendation: GPT-OSS
|
||||
|
||||
**Model:** `gpt-oss:20b`
|
||||
- **Size:** 13 GB
|
||||
- **Context:** 128K tokens
|
||||
- **Tool Support:** ✅ Excellent
|
||||
- **Autonomous Behavior:** ✅ Excellent
|
||||
- **Why:** OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses `write_file` to implement changes directly rather than suggesting code.
|
||||
|
||||
```bash
|
||||
ollama pull gpt-oss:20b
|
||||
```
|
||||
|
||||
### Alternative Options
|
||||
|
||||
#### Llama 3.1 (Best Balance)
|
||||
**Model:** `llama3.1:8b`
|
||||
- **Size:** 4.7 GB
|
||||
- **Context:** 128K tokens
|
||||
- **Tool Support:** ✅ Excellent
|
||||
- **Autonomous Behavior:** ✅ Good
|
||||
- **Why:** Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
|
||||
|
||||
```bash
|
||||
ollama pull llama3.1:8b
|
||||
```
|
||||
|
||||
#### Qwen 2.5 Coder (Coding Focused)
|
||||
**Model:** `qwen2.5-coder:7b` or `qwen2.5-coder:14b`
|
||||
- **Size:** 4.5 GB / 9 GB
|
||||
- **Context:** 32K tokens
|
||||
- **Tool Support:** ✅ Good
|
||||
- **Autonomous Behavior:** ✅ Good
|
||||
- **Why:** Specifically trained for coding tasks. Note: Use Qwen **2.5**, NOT Qwen 3.
|
||||
|
||||
```bash
|
||||
ollama pull qwen2.5-coder:7b
|
||||
# or for more capability:
|
||||
ollama pull qwen2.5-coder:14b
|
||||
```
|
||||
|
||||
#### Mistral (General Purpose)
|
||||
**Model:** `mistral:7b`
|
||||
- **Size:** 4 GB
|
||||
- **Context:** 32K tokens
|
||||
- **Tool Support:** ✅ Good
|
||||
- **Autonomous Behavior:** ✅ Good
|
||||
- **Why:** Fast, efficient, and good at following instructions.
|
||||
|
||||
```bash
|
||||
ollama pull mistral:7b
|
||||
```
|
||||
|
||||
## Models to Avoid
|
||||
|
||||
### ❌ Qwen3-Coder
|
||||
**Problem:** Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using `write_file` to implement changes directly.
|
||||
|
||||
**Status:** Works for reading files and analysis, but not recommended for autonomous coding.
|
||||
|
||||
### ❌ DeepSeek-Coder-V2
|
||||
**Problem:** Does not support tool calling at all.
|
||||
|
||||
**Error:** `"registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"`
|
||||
|
||||
### ❌ StarCoder / CodeLlama (older versions)
|
||||
**Problem:** Most older coding models don't support tool calling or do it poorly.
|
||||
|
||||
## How to Verify Tool Support
|
||||
|
||||
Check if a model supports tools on the Ollama library page:
|
||||
```
|
||||
https://ollama.com/library/<model-name>
|
||||
```
|
||||
|
||||
Look for the "Tools" tag in the model's capabilities.
|
||||
|
||||
You can also check locally:
|
||||
```bash
|
||||
ollama show <model-name>
|
||||
```
|
||||
|
||||
## Model Selection Criteria
|
||||
|
||||
When choosing a model for autonomous coding, prioritize:
|
||||
|
||||
1. **Tool Calling Support** - Must support function calling natively
|
||||
2. **Autonomous Behavior** - Trained to execute rather than suggest
|
||||
3. **Context Window** - Larger is better for complex projects (32K minimum, 128K ideal)
|
||||
4. **Size vs Performance** - Balance between model size and your hardware
|
||||
5. **Prompt Adherence** - Follows system instructions reliably
|
||||
|
||||
## Testing a New Model
|
||||
|
||||
To test if a model works for autonomous coding:
|
||||
|
||||
1. Select it in the UI dropdown
|
||||
2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
|
||||
3. **Expected behavior:** Uses `write_file` tool and creates the file
|
||||
4. **Bad behavior:** Suggests code in markdown blocks or asks what you want to do
|
||||
|
||||
If it suggests code instead of writing it, the model is not suitable for this application.
|
||||
|
||||
## Context Window Management
|
||||
|
||||
Current context usage (approximate):
|
||||
- System prompts: ~1,000 tokens
|
||||
- Tool definitions: ~300 tokens
|
||||
- Per message overhead: ~50-100 tokens
|
||||
- Average conversation: 2-5K tokens
|
||||
|
||||
Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.
|
||||
|
||||
## Performance Notes
|
||||
|
||||
**Speed:** Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.
|
||||
|
||||
**Hardware:**
|
||||
- 8B models: ~8 GB RAM
|
||||
- 20B models: ~16 GB RAM
|
||||
- 70B models: ~48 GB RAM (quantized)
|
||||
|
||||
**Recommendation:** Start with `llama3.1:8b` for speed, upgrade to `gpt-oss:20b` for reliability.
|
||||
|
||||
## Summary
|
||||
|
||||
**For this application:**
|
||||
1. **Best overall:** `gpt-oss:20b` (proven autonomous behavior)
|
||||
2. **Best balance:** `llama3.1:8b` (fast, reliable, well-supported)
|
||||
3. **For coding:** `qwen2.5-coder:7b` (specialized, but smaller context)
|
||||
|
||||
**Avoid:** Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.
|
||||
111
.story_kit/specs/tech/STACK.md
Normal file
111
.story_kit/specs/tech/STACK.md
Normal file
@@ -0,0 +1,111 @@
|
||||
# Tech Stack & Constraints
|
||||
|
||||
## Overview
|
||||
This project is a standalone Rust **web server binary** that serves a Vite/React frontend and exposes a **WebSocket API**. The built frontend assets are packaged with the binary (in a `frontend` directory) and served as static files. It functions as an **Agentic Code Assistant** capable of safely executing tools on the host system.
|
||||
|
||||
## Core Stack
|
||||
* **Backend:** Rust (Web Server)
|
||||
* **MSRV:** Stable (latest)
|
||||
* **Framework:** Poem HTTP server with WebSocket support for streaming; HTTP APIs should use Poem OpenAPI (Swagger) for non-streaming endpoints.
|
||||
* **Frontend:** TypeScript + React
|
||||
* **Build Tool:** Vite
|
||||
* **Styling:** CSS Modules or Tailwind (TBD - Defaulting to CSS Modules)
|
||||
* **State Management:** React Context / Hooks
|
||||
* **Chat UI:** Rendered Markdown with syntax highlighting.
|
||||
|
||||
## Agent Architecture
|
||||
The application follows a **Tool-Use (Function Calling)** architecture:
|
||||
1. **Frontend:** Collects user input and sends it to the LLM.
|
||||
2. **LLM:** Decides to generate text OR request a **Tool Call** (e.g., `execute_shell`, `read_file`).
|
||||
3. **Web Server Backend (The "Hand"):**
|
||||
* Intercepts Tool Calls.
|
||||
* Validates the request against the **Safety Policy**.
|
||||
* Executes the native code (File I/O, Shell Process, Search).
|
||||
* Returns the output (stdout/stderr/file content) to the LLM.
|
||||
* **Streaming:** The backend sends real-time updates over WebSocket to keep the UI responsive during long-running Agent tasks.
|
||||
|
||||
## LLM Provider Abstraction
|
||||
To support both Remote and Local models, the system implements a `ModelProvider` abstraction layer.
|
||||
|
||||
* **Strategy:**
|
||||
* Abstract the differences between API formats (OpenAI-compatible vs Anthropic vs Gemini).
|
||||
* Normalize "Tool Use" definitions, as each provider handles function calling schemas differently.
|
||||
* **Supported Providers:**
|
||||
* **Ollama:** Local inference (e.g., Llama 3, DeepSeek Coder) for privacy and offline usage.
|
||||
* **Anthropic:** Claude 3.5 models (Sonnet, Haiku) via API for coding tasks (Story 12).
|
||||
* **Provider Selection:**
|
||||
* Automatic detection based on model name prefix:
|
||||
* `claude-` → Anthropic API
|
||||
* Otherwise → Ollama
|
||||
* Single unified model dropdown with section headers ("Anthropic", "Ollama")
|
||||
* **API Key Management:**
|
||||
* Anthropic API key stored server-side and persisted securely
|
||||
* On first use of Claude model, user prompted to enter API key
|
||||
* Key persists across sessions (no re-entry needed)
|
||||
|
||||
## Tooling Capabilities
|
||||
|
||||
### 1. Filesystem (Native)
|
||||
* **Scope:** Strictly limited to the user-selected `project_root`.
|
||||
* **Operations:** Read, Write, List, Delete.
|
||||
* **Constraint:** Modifications to `.git/` are strictly forbidden via file APIs (use Git tools instead).
|
||||
|
||||
### 2. Shell Execution
|
||||
* **Library:** `tokio::process` for async execution.
|
||||
* **Constraint:** We do **not** run an interactive shell (repl). We run discrete, stateless commands.
|
||||
* **Allowlist:** The agent may only execute specific binaries:
|
||||
* `git`
|
||||
* `cargo`, `rustc`, `rustfmt`, `clippy`
|
||||
* `npm`, `node`, `yarn`, `pnpm`, `bun`
|
||||
* `ls`, `find`, `grep` (if not using internal search)
|
||||
* `mkdir`, `rm`, `touch`, `mv`, `cp`
|
||||
|
||||
### 3. Search & Navigation
|
||||
* **Library:** `ignore` (by BurntSushi) + `grep` logic.
|
||||
* **Behavior:**
|
||||
* Must respect `.gitignore` files automatically.
|
||||
* Must be performant (parallel traversal).
|
||||
|
||||
## Coding Standards
|
||||
|
||||
### Rust
|
||||
* **Style:** `rustfmt` standard.
|
||||
* **Linter:** `clippy` - Must pass with 0 warnings before merging.
|
||||
* **Error Handling:** Custom `AppError` type deriving `thiserror`. All Commands return `Result<T, AppError>`.
|
||||
* **Concurrency:** Heavy tools (Search, Shell) must run on `tokio` threads to avoid blocking the UI.
|
||||
* **Quality Gates:**
|
||||
* `cargo clippy --all-targets --all-features` must show 0 errors, 0 warnings
|
||||
* `cargo check` must succeed
|
||||
* `cargo test` must pass all tests
|
||||
|
||||
### TypeScript / React
|
||||
* **Style:** Biome formatter (replaces Prettier/ESLint).
|
||||
* **Linter:** Biome - Must pass with 0 errors, 0 warnings before merging.
|
||||
* **Types:** Shared types with Rust (via `tauri-specta` or manual interface matching) are preferred to ensure type safety across the bridge.
|
||||
* **Quality Gates:**
|
||||
* `npx @biomejs/biome check src/` must show 0 errors, 0 warnings
|
||||
* `npm run build` must succeed
|
||||
* No `any` types allowed (use proper types or `unknown`)
|
||||
* React keys must use stable IDs, not array indices
|
||||
* All buttons must have explicit `type` attribute
|
||||
|
||||
## Libraries (Approved)
|
||||
* **Rust:**
|
||||
* `serde`, `serde_json`: Serialization.
|
||||
* `ignore`: Fast recursive directory iteration respecting gitignore.
|
||||
* `walkdir`: Simple directory traversal.
|
||||
* `tokio`: Async runtime.
|
||||
* `reqwest`: For LLM API calls (Anthropic, Ollama).
|
||||
* `eventsource-stream`: For Server-Sent Events (Anthropic streaming).
|
||||
* `uuid`: For unique message IDs.
|
||||
* `chrono`: For timestamps.
|
||||
* `poem`: HTTP server framework.
|
||||
* `poem-openapi`: OpenAPI (Swagger) for non-streaming HTTP APIs.
|
||||
* **JavaScript:**
|
||||
* `react-markdown`: For rendering chat responses.
|
||||
|
||||
## Safety & Sandbox
|
||||
1. **Project Scope:** The application must strictly enforce that it does not read/write outside the `project_root` selected by the user.
|
||||
2. **Human in the Loop:**
|
||||
* Shell commands that modify state (non-readonly) should ideally require a UI confirmation (configurable).
|
||||
* File writes must be confirmed or revertible.
|
||||
Reference in New Issue
Block a user