Compare | Lilith AI

Top picks

Three quick choices without scrolling: one for the hardest work, one for cheap volume and one for teams that do not want to depend only on a closed API.

Claude Fable 5

top pick

Anthropic

My read

coding pick backed by current AA.

✓ Use for: coding · AI agents · company knowledge

✗ Skip for: mass volume · strict latency

Claude Fable 5 has IQ 64.9 and input $10/M in the sources. Consider it for coding, AI agents; for mass-volume, real-time-latency, run a second benchmark before rollout.

premium, when mistakes hurt · market frontier · verified by external data

Signals: market frontier premium, when mistakes hurt codingAI agents

When to pick it →

GPT-5.5

top pick

OpenAI

My read

coding pick backed by current AA.

✓ Use for: coding · AI agents · document extraction

✗ Skip for: self-hosted stack · strict latency

GPT-5.5 has IQ 60.2 and input $5/M and DeepSWE pass@1 70.0 % in the sources. Consider it for coding, AI agents; for self-hosted stack, real-time-latency, run a second benchmark before rollout.

mid-budget · market frontier · verified by external data

Signals: market frontier mid-budget codingAI agents

When to pick it →

DeepSeek V4 Flash

top pick

DeepSeek

My read

batch pick backed by current AA.

✓ Use for: large batches · fast replies · self-hosted stack

✗ Skip for: deep frontier reasoning · top coding

DeepSeek V4 Flash has IQ 46.5 and input $0.14/M in the sources. Consider it for large batches, fast replies; for deep-frontier-reasoning, top-coding, run a second benchmark before rollout.

open / self-run · specialist · verified by external data

Signals: specialist open / self-run large batchesfast replies

When to pick it →

When the shortlist is not enough

This is the broader catalogue. Filters narrow it by situation; details keep hard numbers and sources out of the first read.

Claude Fable 5

Anthropic

My read

coding pick backed by current AA.

✓ Use for: coding · AI agents · company knowledge

✗ Skip for: mass volume · strict latency

Claude Fable 5 has IQ 64.9 and input $10/M in the sources. Consider it for coding, AI agents; for mass-volume, real-time-latency, run a second benchmark before rollout.

premium, when mistakes hurt · market frontier · verified by external data

Signals: market frontier premium, when mistakes hurt codingAI agents

When to pick it →

Claude Opus 4.8

Anthropic

My read

coding pick backed by current AA.

✓ Use for: coding · AI agents · company knowledge

✗ Skip for: mass volume · strict latency

Claude Opus 4.8 has IQ 61.4 and input $5/M and DeepSWE pass@1 58.2 % in the sources. Consider it for coding, AI agents; for mass-volume, real-time-latency, run a second benchmark before rollout.

mid-budget · market frontier · verified by external data

Signals: market frontier mid-budget codingAI agents

When to pick it →

GPT-5.5

OpenAI

My read

coding pick backed by current AA.

✓ Use for: coding · AI agents · document extraction

✗ Skip for: self-hosted stack · strict latency

GPT-5.5 has IQ 60.2 and input $5/M and DeepSWE pass@1 70.0 % in the sources. Consider it for coding, AI agents; for self-hosted stack, real-time-latency, run a second benchmark before rollout.

mid-budget · market frontier · verified by external data

Signals: market frontier mid-budget codingAI agents

When to pick it →

Gemini 3.1 Pro Preview

Google

My read

rag pick backed by current AA.

✓ Use for: company knowledge · vision and multimodal · multilingual content

✗ Skip for: strict latency · self-hosted stack

Gemini 3.1 Pro Preview has IQ 57.2 and input $2/M and DeepSWE pass@1 9.7 % in the sources. Consider it for company knowledge, vision and multimodal; for real-time-latency, self-hosted stack, run a second benchmark before rollout.

mid-budget · market frontier · verified by external data

Signals: market frontier mid-budget company knowledgevision and multimodal

When to pick it →

Qwen3.7 Max

Alibaba

My read

multilingual pick backed by current AA.

✓ Use for: multilingual content · coding · large batches

✗ Skip for: self-hosted stack · premium agents

Qwen3.7 Max has IQ 56.6 and input $2.5/M and DeepSWE pass@1 17.7 % in the sources. Consider it for multilingual content, coding; for self-hosted stack, premium-agents, run a second benchmark before rollout.

mid-budget · market frontier · verified by external data

Signals: market frontier mid-budget multilingual contentcoding

When to pick it →

Gemini 3.5 Flash

Google

My read

batch pick backed by current AA.

✓ Use for: large batches · company knowledge · fast replies

✗ Skip for: hard coding work · strict latency

Gemini 3.5 Flash has IQ 55.3 and input $1.5/M and DeepSWE pass@1 28.3 % in the sources. Consider it for large batches, company knowledge; for deep-coding, real-time-latency, run a second benchmark before rollout.

mid-budget · specialist · verified by external data

Signals: specialist mid-budget large batchescompany knowledge

When to pick it →

Kimi K2.6

Moonshot

My read

batch pick backed by current AA.

✓ Use for: large batches · coding · company knowledge

✗ Skip for: company controls and audit · reliable tool use

Kimi K2.6 has IQ 53.9 and input $0.95/M and DeepSWE pass@1 23.9 % in the sources. Consider it for large batches, coding; for enterprise-governance, tool-use, run a second benchmark before rollout.

cheap at volume · specialist · verified by external data

Signals: specialist cheap at volume large batchescoding

When to pick it →

Claude Sonnet 4.6

Anthropic

My read

coding pick backed by current AA.

✓ Use for: coding · AI agents · company knowledge

✗ Skip for: self-hosted stack · mass volume

Claude Sonnet 4.6 has IQ 51.7 and input $3/M and DeepSWE pass@1 31.8 % in the sources. Consider it for coding, AI agents; for self-hosted stack, mass-volume, run a second benchmark before rollout.

mid-budget · specialist · verified by external data

Signals: specialist mid-budget codingAI agents

When to pick it →

GLM-5.1

Z.AI/Zhipu

My read

self-hosted pick backed by current AA.

✓ Use for: self-hosted stack · sensitive deployments · large batches

✗ Skip for: premium agents · top coding

GLM-5.1 has IQ 51.4 and input $1.4/M and DeepSWE pass@1 17.5 % in the sources. Consider it for self-hosted stack, sensitive deployments; for premium-agents, top-coding, run a second benchmark before rollout.

open / self-run · specialist · verified by external data

Signals: specialist open / self-run self-hosted stacksensitive deployments

When to pick it →

DeepSeek V4 Flash

DeepSeek

My read

batch pick backed by current AA.

✓ Use for: large batches · fast replies · self-hosted stack

✗ Skip for: deep frontier reasoning · top coding

DeepSeek V4 Flash has IQ 46.5 and input $0.14/M in the sources. Consider it for large batches, fast replies; for deep-frontier-reasoning, top-coding, run a second benchmark before rollout.

open / self-run · specialist · verified by external data

Signals: specialist open / self-run large batchesfast replies

When to pick it →

DeepSeek V4 Pro

DeepSeek

My read

batch pick backed by current AA.

✓ Use for: large batches · coding · self-hosted stack

✗ Skip for: company controls and audit · premium agents

DeepSeek V4 Pro has IQ 51.5 and input $0.435/M and DeepSWE pass@1 7.5 % in the sources. Consider it for large batches, coding; for enterprise-governance, premium-agents, run a second benchmark before rollout.

open / self-run · specialist · verified by external data

Signals: specialist open / self-run large batchescoding

When to pick it →

Command A+

Cohere

My read

rag pick backed by current AA.

✓ Use for: company knowledge · document extraction · sensitive deployments

✗ Skip for: top coding · deep frontier reasoning

Command A Plus has IQ 37.2 and input $0/M in the sources. Consider it for company knowledge, document extraction; for top-coding, deep-frontier-reasoning, run a second benchmark before rollout.

cheap at volume · specialist · verified by external data

Signals: specialist cheap at volume company knowledgedocument extraction

When to pick it →

Grok 4.3

xAI

My read

coding pick backed by current AA.

✓ Use for: coding · fast replies · document extraction

✗ Skip for: self-hosted stack · sensitive deployments

Grok 4.3 has IQ 53.2 and input $1.25/M in the sources. Consider it for coding, fast replies; for self-hosted stack, sensitive deployments, run a second benchmark before rollout.

mid-budget · specialist · verified by external data

Signals: specialist mid-budget codingfast replies

When to pick it →

Llama 4 Maverick

Mistral Medium 3.5

Mistral AI

My read

compliance pick backed by current AA.

✓ Use for: sensitive deployments · company knowledge · document extraction

✗ Skip for: deep frontier reasoning · top coding

Mistral Medium 3.5 has IQ 39.2 and input $1.5/M in the sources. Consider it for sensitive deployments, company knowledge; for deep-frontier-reasoning, top-coding, run a second benchmark before rollout.

mid-budget · specialist · verified by external data

Signals: specialist mid-budget sensitive deploymentscompany knowledge

When to pick it →

Mistral Large 3

Mistral AI

My read

compliance pick backed by current AA.

✓ Use for: sensitive deployments · document extraction · large batches

✗ Skip for: deep frontier reasoning · AI agents

Mistral Large 3 has IQ 22.8 and input $0.5/M in the sources. Consider it for sensitive deployments, document extraction; for deep-frontier-reasoning, AI agents, run a second benchmark before rollout.

cheap at volume · specialist · verified by external data

Signals: specialist cheap at volume sensitive deploymentsdocument extraction

When to pick it →

How much to trust this

Curated decision snapshot, not a live leaderboard. Primary data comes from Artificial Analysis, with LMArena, LLM Stats, Aider, SWE-Bench, DeepSWE and HF Open LLM Leaderboard used when available during the run. The page is reviewed roughly every two weeks and cards without data show a loading state instead of invented numbers.