4.4 KiB
Model Selection Guide
Overview
This application requires LLM models that support tool calling (function calling) and are capable of autonomous execution rather than just code suggestion. Not all models are suitable for agentic workflows.
Recommended Models
Primary Recommendation: GPT-OSS
Model: gpt-oss:20b
- Size: 13 GB
- Context: 128K tokens
- Tool Support: ✅ Excellent
- Autonomous Behavior: ✅ Excellent
- Why: OpenAI's open-weight model specifically designed for "agentic tasks". Reliably uses
write_fileto implement changes directly rather than suggesting code.
ollama pull gpt-oss:20b
Alternative Options
Llama 3.1 (Best Balance)
Model: llama3.1:8b
- Size: 4.7 GB
- Context: 128K tokens
- Tool Support: ✅ Excellent
- Autonomous Behavior: ✅ Good
- Why: Industry standard for tool calling. Well-documented, reliable, and smaller than GPT-OSS.
ollama pull llama3.1:8b
Qwen 2.5 Coder (Coding Focused)
Model: qwen2.5-coder:7b or qwen2.5-coder:14b
- Size: 4.5 GB / 9 GB
- Context: 32K tokens
- Tool Support: ✅ Good
- Autonomous Behavior: ✅ Good
- Why: Specifically trained for coding tasks. Note: Use Qwen 2.5, NOT Qwen 3.
ollama pull qwen2.5-coder:7b
# or for more capability:
ollama pull qwen2.5-coder:14b
Mistral (General Purpose)
Model: mistral:7b
- Size: 4 GB
- Context: 32K tokens
- Tool Support: ✅ Good
- Autonomous Behavior: ✅ Good
- Why: Fast, efficient, and good at following instructions.
ollama pull mistral:7b
Models to Avoid
❌ Qwen3-Coder
Problem: Despite supporting tool calling, Qwen3-Coder is trained more as a "helpful assistant" and tends to suggest code in markdown blocks rather than using write_file to implement changes directly.
Status: Works for reading files and analysis, but not recommended for autonomous coding.
❌ DeepSeek-Coder-V2
Problem: Does not support tool calling at all.
Error: "registry.ollama.ai/library/deepseek-coder-v2:latest does not support tools"
❌ StarCoder / CodeLlama (older versions)
Problem: Most older coding models don't support tool calling or do it poorly.
How to Verify Tool Support
Check if a model supports tools on the Ollama library page:
https://ollama.com/library/<model-name>
Look for the "Tools" tag in the model's capabilities.
You can also check locally:
ollama show <model-name>
Model Selection Criteria
When choosing a model for autonomous coding, prioritize:
- Tool Calling Support - Must support function calling natively
- Autonomous Behavior - Trained to execute rather than suggest
- Context Window - Larger is better for complex projects (32K minimum, 128K ideal)
- Size vs Performance - Balance between model size and your hardware
- Prompt Adherence - Follows system instructions reliably
Testing a New Model
To test if a model works for autonomous coding:
- Select it in the UI dropdown
- Ask it to create a simple file: "Create a new file called test.txt with 'Hello World' in it"
- Expected behavior: Uses
write_filetool and creates the file - Bad behavior: Suggests code in markdown blocks or asks what you want to do
If it suggests code instead of writing it, the model is not suitable for this application.
Context Window Management
Current context usage (approximate):
- System prompts: ~1,000 tokens
- Tool definitions: ~300 tokens
- Per message overhead: ~50-100 tokens
- Average conversation: 2-5K tokens
Most models will handle 20-30 exchanges before context becomes an issue. The agent loop is limited to 30 turns to prevent context exhaustion.
Performance Notes
Speed: Smaller models (3B-8B) are faster but less capable. Larger models (20B-70B) are more reliable but slower.
Hardware:
- 8B models: ~8 GB RAM
- 20B models: ~16 GB RAM
- 70B models: ~48 GB RAM (quantized)
Recommendation: Start with llama3.1:8b for speed, upgrade to gpt-oss:20b for reliability.
Summary
For this application:
- Best overall:
gpt-oss:20b(proven autonomous behavior) - Best balance:
llama3.1:8b(fast, reliable, well-supported) - For coding:
qwen2.5-coder:7b(specialized, but smaller context)
Avoid: Qwen3-Coder, DeepSeek-Coder-V2, any model without tool support.