Files
storkit/.living_spec/specs/tech/MODEL_SELECTION.md
Dave 990441dfc1 feat: Story 8 - Collapsible tool outputs + autonomous coding improvements
Implemented Story 8: Collapsible Tool Outputs
- Tool outputs now render in <details>/<summary> elements, collapsed by default
- Summary shows tool name with key argument (e.g., ▶ read_file(src/main.rs))
- Added arrow rotation animation and scrollable content (max 300px)
- Enhanced tool_calls display to show arguments inline
- Added CSS styling for dark theme consistency

Fixed: LLM autonomous coding behavior
- Strengthened system prompt with explicit examples and directives
- Implemented triple-reinforcement system (primary prompt + reminder + message prefixes)
- Improved tool descriptions to be more explicit and action-oriented
- Increased MAX_TURNS from 10 to 30 for complex agentic workflows
- Added debug logging for Ollama requests/responses
- Result: GPT-OSS (gpt-oss:20b) now successfully uses write_file autonomously

Documentation improvements
- Created MODEL_SELECTION.md guide with recommendations
- Updated PERSONA.md spec to emphasize autonomous agent behavior
- Updated UI_UX.md spec with collapsible tool output requirements
- Updated SDSW workflow: LLM archives stories and performs squash merge

Cleanup
- Removed unused ToolTester.tsx component
2025-12-25 15:18:12 +00:00

4.4 KiB

Model Selection Guide

Overview

This application requires LLM models that support tool calling (function calling) and are capable of autonomous execution rather than just code suggestion. Not all models are suitable for agentic workflows.

Primary Recommendation: GPT-OSS

Model: gpt-oss:20b

  • Size: 13 GB
  • Context: 128K tokens
  • Tool Support: Excellent
  • Autonomous Behavior: Excellent
  • Why: OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses write_file to implement changes directly rather than suggesting code.
ollama pull gpt-oss:20b

Alternative Options

Llama 3.1 (Best Balance)

Model: llama3.1:8b

  • Size: 4.7 GB
  • Context: 128K tokens
  • Tool Support: Excellent
  • Autonomous Behavior: Good
  • Why: Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
ollama pull llama3.1:8b

Qwen 2.5 Coder (Coding Focused)

Model: qwen2.5-coder:7b or qwen2.5-coder:14b

  • Size: 4.5 GB / 9 GB
  • Context: 32K tokens
  • Tool Support: Good
  • Autonomous Behavior: Good
  • Why: Specifically trained for coding tasks. Note: Use Qwen 2.5, NOT Qwen 3.
ollama pull qwen2.5-coder:7b
# or for more capability:
ollama pull qwen2.5-coder:14b

Mistral (General Purpose)

Model: mistral:7b

  • Size: 4 GB
  • Context: 32K tokens
  • Tool Support: Good
  • Autonomous Behavior: Good
  • Why: Fast, efficient, and good at following instructions.
ollama pull mistral:7b

Models to Avoid

Qwen3-Coder

Problem: Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using write_file to implement changes directly.

Status: Works for reading files and analysis, but not recommended for autonomous coding.

DeepSeek-Coder-V2

Problem: Does not support tool calling at all.

Error: "registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"

StarCoder / CodeLlama (older versions)

Problem: Most older coding models don't support tool calling or do it poorly.

How to Verify Tool Support

Check if a model supports tools on the Ollama library page:

https://ollama.com/library/<model-name>

Look for the "Tools" tag in the model's capabilities.

You can also check locally:

ollama show <model-name>

Model Selection Criteria

When choosing a model for autonomous coding, prioritize:

  1. Tool Calling Support - Must support function calling natively
  2. Autonomous Behavior - Trained to execute rather than suggest
  3. Context Window - Larger is better for complex projects (32K minimum, 128K ideal)
  4. Size vs Performance - Balance between model size and your hardware
  5. Prompt Adherence - Follows system instructions reliably

Testing a New Model

To test if a model works for autonomous coding:

  1. Select it in the UI dropdown
  2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
  3. Expected behavior: Uses write_file tool and creates the file
  4. Bad behavior: Suggests code in markdown blocks or asks what you want to do

If it suggests code instead of writing it, the model is not suitable for this application.

Context Window Management

Current context usage (approximate):

  • System prompts: ~1,000 tokens
  • Tool definitions: ~300 tokens
  • Per message overhead: ~50-100 tokens
  • Average conversation: 2-5K tokens

Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.

Performance Notes

Speed: Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.

Hardware:

  • 8B models: ~8 GB RAM
  • 20B models: ~16 GB RAM
  • 70B models: ~48 GB RAM (quantized)

Recommendation: Start with llama3.1:8b for speed, upgrade to gpt-oss:20b for reliability.

Summary

For this application:

  1. Best overall: gpt-oss:20b (proven autonomous behavior)
  2. Best balance: llama3.1:8b (fast, reliable, well-supported)
  3. For coding: qwen2.5-coder:7b (specialized, but smaller context)

Avoid: Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.