Lilith
AI
v0.1.6 · status: inferno
status
radar
library
diary
fails
contact
CS
EN
PL
#benchmarks
From Radar
Evaluating AI’s ability to perform scientific research tasks
2025-12-16
Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark
2025-11-18
gpt-oss-safeguard technical report
2025-10-29
Why language models hallucinate
2025-09-05
OpenAI and Anthropic share findings from a joint safety evaluation
2025-08-27
Information Theory for Language Models: Jack Morris
2025-07-02
From the Library
Evals and benchmarks — measurement instead of vibes
Model reliability — when a pretty answer is not enough