🔥 Verdict LLM Evaluation Results

Comprehensive local model benchmarking • March 29, 2026

Test Runs
9
Total Inferences
62
Models Tested
4
Total Cost
$0.00
Annual Savings
$1,800

🏆 Overall Leaderboard

Rank
Model
Score
Cases
Latency
Status
1
Qwen 2.5 7B
9.53
18 cases
16.4s avg
✓ Production
2
Qwen 7B (alt)
9.44
18 cases
3.9s avg
✓ Production
3
DeepSeek R1 7B
5.83
13 cases
32.8s avg
⚠ Judge errors
4
TinyLlama 1.1B
3.43
13 cases
20.9s avg
✗ Not ready

⚡ Key Finding

Qwen 2.5 7B is production-ready at 9.53/10 — achieving near-perfect scores on structured tasks, multilingual comprehension, and code generation, all at zero cost.

📊 Test Suite Results

⚡ Speed Test

Winner Qwen 2.5 7B
Score 9.2/10
Cases 5
Avg Latency 1.2s
Perfect Scores 4/5

🌍 Multilingual Test

Winner Qwen 2.5 7B
Score 10.0/10 ⭐
Languages 4
Perfect Scores 5/5
Status FLAWLESS

📝 Context Stress

Winner Qwen 2.5 7B
Score 9.73/10
Cases 3
Avg Latency 85.8s
vs DeepSeek 2× faster

💻 Coding Benchmark

Winner Qwen 2.5 7B
Score 9.38/10
Cases 8
Perfect Scores 4/8
Avg Latency 4.2s

💡 Key Findings

  • Qwen 2.5 7B scored PERFECT 10/10 on all 5 multilingual cases (Spanish, French, German, Japanese)
  • Production-ready for code generation — 9.38/10 average, perfect scores on TypeScript, Python, refactoring
  • 2× faster than DeepSeek with higher quality (9.53 vs 5.83, 16.4s vs 32.8s)
  • TinyLlama is unusable — 3.43/10, fails multilingual, context, and reasoning tasks
  • Cost savings: $1,800/year vs cloud for continuous evaluation workloads
  • Local models excel at structured tasks — JSON, code, schemas, fact extraction
  • Verdict enables data-driven model selection at zero cost with unlimited testing

🎯 Recommendations

✅ Deploy: Qwen 2.5 7B

Use for: Code generation, structured output (JSON/schemas), multilingual content, documentation, fact extraction, math calculations

Avoid for: Novel creative writing, ambiguous edge cases, very long context (>4K tokens)

Config: ollama run qwen2.5:7b • 4.7 GB VRAM • 4096 token context

⚠️ Investigate: DeepSeek R1 7B

Shows promise but needs debugging of judge errors. Potentially strong alternative if issues resolved.

❌ Avoid: TinyLlama 1.1B

Not production-ready. Only suitable for trivial tasks or experimentation.