.story_kit/specs/tech/MODEL_SELECTION.md

# Model Selection Guide

## Overview
This application requires LLM models that support **tool calling** (function calling) and are capable of **autonomous execution** rather than just code suggestion. Not all models are suitable for agentic workflows.

## Recommended Models

### Primary Recommendation: GPT-OSS

**Model:** `gpt-oss:20b`
- **Size:** 13 GB
- **Context:** 128K tokens
- **Tool Support:** ✅ Excellent
- **Autonomous Behavior:** ✅ Excellent
- **Why:** OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses `write_file` to implement changes directly rather than suggesting code.

```bash
ollama pull gpt-oss:20b
```

### Alternative Options

#### Llama 3.1 (Best Balance)
**Model:** `llama3.1:8b`
- **Size:** 4.7 GB
- **Context:** 128K tokens
- **Tool Support:** ✅ Excellent
- **Autonomous Behavior:** ✅ Good
- **Why:** Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.

```bash
ollama pull llama3.1:8b
```

#### Qwen 2.5 Coder (Coding Focused)
**Model:** `qwen2.5-coder:7b` or `qwen2.5-coder:14b`
- **Size:** 4.5 GB / 9 GB
- **Context:** 32K tokens
- **Tool Support:** ✅ Good
- **Autonomous Behavior:** ✅ Good
- **Why:** Specifically trained for coding tasks. Note: Use Qwen **2.5**, NOT Qwen 3.

```bash
ollama pull qwen2.5-coder:7b
# or for more capability:
ollama pull qwen2.5-coder:14b
```

#### Mistral (General Purpose)
**Model:** `mistral:7b`
- **Size:** 4 GB
- **Context:** 32K tokens
- **Tool Support:** ✅ Good
- **Autonomous Behavior:** ✅ Good
- **Why:** Fast, efficient, and good at following instructions.

```bash
ollama pull mistral:7b
```

## Models to Avoid

### ❌ Qwen3-Coder
**Problem:** Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using `write_file` to implement changes directly.

**Status:** Works for reading files and analysis, but not recommended for autonomous coding.

### ❌ DeepSeek-Coder-V2
**Problem:** Does not support tool calling at all.

**Error:** `"registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"`

### ❌ StarCoder / CodeLlama (older versions)
**Problem:** Most older coding models don't support tool calling or do it poorly.

## How to Verify Tool Support

Check if a model supports tools on the Ollama library page:
```
https://ollama.com/library/<model-name>
```

Look for the "Tools" tag in the model's capabilities.

You can also check locally:
```bash
ollama show <model-name>
```

## Model Selection Criteria

When choosing a model for autonomous coding, prioritize:

1. **Tool Calling Support** - Must support function calling natively
2. **Autonomous Behavior** - Trained to execute rather than suggest
3. **Context Window** - Larger is better for complex projects (32K minimum, 128K ideal)
4. **Size vs Performance** - Balance between model size and your hardware
5. **Prompt Adherence** - Follows system instructions reliably

## Testing a New Model

To test if a model works for autonomous coding:

1. Select it in the UI dropdown
2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
3. **Expected behavior:** Uses `write_file` tool and creates the file
4. **Bad behavior:** Suggests code in markdown blocks or asks what you want to do

If it suggests code instead of writing it, the model is not suitable for this application.

## Context Window Management

Current context usage (approximate):
- System prompts: ~1,000 tokens
- Tool definitions: ~300 tokens
- Per message overhead: ~50-100 tokens
- Average conversation: 2-5K tokens

Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.

## Performance Notes

**Speed:** Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.

**Hardware:** 
- 8B models: ~8 GB RAM
- 20B models: ~16 GB RAM
- 70B models: ~48 GB RAM (quantized)

**Recommendation:** Start with `llama3.1:8b` for speed, upgrade to `gpt-oss:20b` for reliability.

## Summary

**For this application:**
1. **Best overall:** `gpt-oss:20b` (proven autonomous behavior)
2. **Best balance:** `llama3.1:8b` (fast, reliable, well-supported)
3. **For coding:** `qwen2.5-coder:7b` (specialized, but smaller context)

**Avoid:** Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.
feat: Story 8 - Collapsible tool outputs + autonomous coding improvements Implemented Story 8: Collapsible Tool Outputs - Tool outputs now render in <details>/<summary> elements, collapsed by default - Summary shows tool name with key argument (e.g., ▶ read_file(src/main.rs)) - Added arrow rotation animation and scrollable content (max 300px) - Enhanced tool_calls display to show arguments inline - Added CSS styling for dark theme consistency Fixed: LLM autonomous coding behavior - Strengthened system prompt with explicit examples and directives - Implemented triple-reinforcement system (primary prompt + reminder + message prefixes) - Improved tool descriptions to be more explicit and action-oriented - Increased MAX_TURNS from 10 to 30 for complex agentic workflows - Added debug logging for Ollama requests/responses - Result: GPT-OSS (gpt-oss:20b) now successfully uses write_file autonomously Documentation improvements - Created MODEL_SELECTION.md guide with recommendations - Updated PERSONA.md spec to emphasize autonomous agent behavior - Updated UI_UX.md spec with collapsible tool output requirements - Updated SDSW workflow: LLM archives stories and performs squash merge Cleanup - Removed unused ToolTester.tsx component 2025-12-25 15:18:12 +00:00			`# Model Selection Guide`

			`## Overview`
			`This application requires LLM models that support tool calling (function calling) and are capable of autonomous execution rather than just code suggestion. Not all models are suitable for agentic workflows.`

			`## Recommended Models`

			`### Primary Recommendation: GPT-OSS`

			Model: `gpt-oss:20b`
			`- Size: 13 GB`
			`- Context: 128K tokens`
			`- Tool Support: ✅ Excellent`
			`- Autonomous Behavior: ✅ Excellent`
			- Why: OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses `write_file` to implement changes directly rather than suggesting code.

			```bash
			`ollama pull gpt-oss:20b`
			```

			`### Alternative Options`

			`#### Llama 3.1 (Best Balance)`
			Model: `llama3.1:8b`
			`- Size: 4.7 GB`
			`- Context: 128K tokens`
			`- Tool Support: ✅ Excellent`
			`- Autonomous Behavior: ✅ Good`
			`- Why: Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.`

			```bash
			`ollama pull llama3.1:8b`
			```

			`#### Qwen 2.5 Coder (Coding Focused)`
			Model: `qwen2.5-coder:7b` or `qwen2.5-coder:14b`
			`- Size: 4.5 GB / 9 GB`
			`- Context: 32K tokens`
			`- Tool Support: ✅ Good`
			`- Autonomous Behavior: ✅ Good`
			`- Why: Specifically trained for coding tasks. Note: Use Qwen 2.5, NOT Qwen 3.`

			```bash
			`ollama pull qwen2.5-coder:7b`
			`# or for more capability:`
			`ollama pull qwen2.5-coder:14b`
			```

			`#### Mistral (General Purpose)`
			Model: `mistral:7b`
			`- Size: 4 GB`
			`- Context: 32K tokens`
			`- Tool Support: ✅ Good`
			`- Autonomous Behavior: ✅ Good`
			`- Why: Fast, efficient, and good at following instructions.`

			```bash
			`ollama pull mistral:7b`
			```

			`## Models to Avoid`

			`### ❌ Qwen3-Coder`
			Problem: Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using `write_file` to implement changes directly.

			`Status: Works for reading files and analysis, but not recommended for autonomous coding.`

			`### ❌ DeepSeek-Coder-V2`
			`Problem: Does not support tool calling at all.`

			Error: `"registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"`

			`### ❌ StarCoder / CodeLlama (older versions)`
			`Problem: Most older coding models don't support tool calling or do it poorly.`

			`## How to Verify Tool Support`

			`Check if a model supports tools on the Ollama library page:`
			```
			`https://ollama.com/library/<model-name>`
			```

			`Look for the "Tools" tag in the model's capabilities.`

			`You can also check locally:`
			```bash
			`ollama show <model-name>`
			```

			`## Model Selection Criteria`

			`When choosing a model for autonomous coding, prioritize:`

			`1. Tool Calling Support - Must support function calling natively`
			`2. Autonomous Behavior - Trained to execute rather than suggest`
			`3. Context Window - Larger is better for complex projects (32K minimum, 128K ideal)`
			`4. Size vs Performance - Balance between model size and your hardware`
			`5. Prompt Adherence - Follows system instructions reliably`

			`## Testing a New Model`

			`To test if a model works for autonomous coding:`

			`1. Select it in the UI dropdown`
			`2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"`
			3. Expected behavior: Uses `write_file` tool and creates the file
			`4. Bad behavior: Suggests code in markdown blocks or asks what you want to do`

			`If it suggests code instead of writing it, the model is not suitable for this application.`

			`## Context Window Management`

			`Current context usage (approximate):`
			`- System prompts: ~1,000 tokens`
			`- Tool definitions: ~300 tokens`
			`- Per message overhead: ~50-100 tokens`
			`- Average conversation: 2-5K tokens`

			`Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.`

			`## Performance Notes`

			`Speed: Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.`

			`Hardware:`
			`- 8B models: ~8 GB RAM`
			`- 20B models: ~16 GB RAM`
			`- 70B models: ~48 GB RAM (quantized)`

			Recommendation: Start with `llama3.1:8b` for speed, upgrade to `gpt-oss:20b` for reliability.

			`## Summary`

			`For this application:`
			1. Best overall: `gpt-oss:20b` (proven autonomous behavior)
			2. Best balance: `llama3.1:8b` (fast, reliable, well-supported)
			3. For coding: `qwen2.5-coder:7b` (specialized, but smaller context)

			`Avoid: Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.`