Files
storkit/.story_kit/specs/tech/MODEL_SELECTION.md
2026-02-16 15:44:20 +00:00

4.4 KiB

Model Selection Guide

Overview

This application requires LLM models that support tool calling (function calling) and are capable of autonomous execution rather than just code suggestion. Not all models are suitable for agentic workflows.

Primary Recommendation: GPT-OSS

Model: gpt-oss:20b

  • Size: 13 GB
  • Context: 128K tokens
  • Tool Support: Excellent
  • Autonomous Behavior: Excellent
  • Why: OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses write_file to implement changes directly rather than suggesting code.
ollama pull gpt-oss:20b

Alternative Options

Llama 3.1 (Best Balance)

Model: llama3.1:8b

  • Size: 4.7 GB
  • Context: 128K tokens
  • Tool Support: Excellent
  • Autonomous Behavior: Good
  • Why: Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
ollama pull llama3.1:8b

Qwen 2.5 Coder (Coding Focused)

Model: qwen2.5-coder:7b or qwen2.5-coder:14b

  • Size: 4.5 GB / 9 GB
  • Context: 32K tokens
  • Tool Support: Good
  • Autonomous Behavior: Good
  • Why: Specifically trained for coding tasks. Note: Use Qwen 2.5, NOT Qwen 3.
ollama pull qwen2.5-coder:7b
# or for more capability:
ollama pull qwen2.5-coder:14b

Mistral (General Purpose)

Model: mistral:7b

  • Size: 4 GB
  • Context: 32K tokens
  • Tool Support: Good
  • Autonomous Behavior: Good
  • Why: Fast, efficient, and good at following instructions.
ollama pull mistral:7b

Models to Avoid

Qwen3-Coder

Problem: Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using write_file to implement changes directly.

Status: Works for reading files and analysis, but not recommended for autonomous coding.

DeepSeek-Coder-V2

Problem: Does not support tool calling at all.

Error: "registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"

StarCoder / CodeLlama (older versions)

Problem: Most older coding models don't support tool calling or do it poorly.

How to Verify Tool Support

Check if a model supports tools on the Ollama library page:

https://ollama.com/library/<model-name>

Look for the "Tools" tag in the model's capabilities.

You can also check locally:

ollama show <model-name>

Model Selection Criteria

When choosing a model for autonomous coding, prioritize:

  1. Tool Calling Support - Must support function calling natively
  2. Autonomous Behavior - Trained to execute rather than suggest
  3. Context Window - Larger is better for complex projects (32K minimum, 128K ideal)
  4. Size vs Performance - Balance between model size and your hardware
  5. Prompt Adherence - Follows system instructions reliably

Testing a New Model

To test if a model works for autonomous coding:

  1. Select it in the UI dropdown
  2. Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
  3. Expected behavior: Uses write_file tool and creates the file
  4. Bad behavior: Suggests code in markdown blocks or asks what you want to do

If it suggests code instead of writing it, the model is not suitable for this application.

Context Window Management

Current context usage (approximate):

  • System prompts: ~1,000 tokens
  • Tool definitions: ~300 tokens
  • Per message overhead: ~50-100 tokens
  • Average conversation: 2-5K tokens

Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.

Performance Notes

Speed: Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.

Hardware:

  • 8B models: ~8 GB RAM
  • 20B models: ~16 GB RAM
  • 70B models: ~48 GB RAM (quantized)

Recommendation: Start with llama3.1:8b for speed, upgrade to gpt-oss:20b for reliability.

Summary

For this application:

  1. Best overall: gpt-oss:20b (proven autonomous behavior)
  2. Best balance: llama3.1:8b (fast, reliable, well-supported)
  3. For coding: qwen2.5-coder:7b (specialized, but smaller context)

Avoid: Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.