SkillOpt trains agent skills like text weights | Radar

Microsoft is showing SkillOpt, an optimizer that improves an agent skill file without changing model weights. For teams building agents, the important part is the validation gate, not another layer of prompt mysticism.

Microsoft moves training from the model into the skill file

The Microsoft Research primary page was blocked during verification, so this article relies carefully on Microsoft GitHub, the project page and the arXiv paper metadata. According to those sources, SkillOpt treats a normal markdown skill as trainable state for a frozen LLM agent. The model stays fixed. The instruction document placed in the agent context changes.

The loop is familiar from machine learning, translated into text. The agent runs tasks, collects scored trajectories, an optimizer proposes small add, delete and replace edits, and a change is accepted only if it improves a held-out validation score. The deployed artifact is best_skill.md, typically 300 to 2,000 tokens.

Microsoft reports six benchmarks, seven target models and three harnesses: direct chat, Codex CLI and Claude Code CLI. SkillOpt is described as best or tied-best across all 52 evaluated cells. On GPT-5.5, Microsoft reports gains of 23.5 points in direct chat, 24.8 points inside the Codex loop and 19.1 points inside Claude Code.

Agent teams get a new place to engineer reliability

The point is not that another system can rewrite a prompt. Plenty of tools can do that. SkillOpt pushes skill work closer to engineering discipline: every change has a budget, a validation split, a memory of rejected edits and a measured acceptance rule.

For enterprise teams, the useful cases are repetitive agent tasks: document handling, spreadsheet manipulation, tool use and controlled workflows. In those domains, procedural knowledge can be written down and tested. The skill file becomes an auditable piece of runtime behavior, not a mysterious layer inside the model.

The second layer is awkward for classic prompt engineering. If instructions are not versioned, tested and rejected against a metric, they are not production configuration. They are a note stuck to a monitor.

The validation split decides whether this learns or just tunes the benchmark

The weak point is the same as in every optimization system: evaluation quality. SkillOpt can improve only what the scorer can measure. For precise benchmarks, that is sensible. For open-ended agent behavior, where quality includes judgment, timing and safe stopping, the score can quickly become a cartoon of reality.

Transfer outside demo tasks is the other risk. Microsoft reports transfer across models, scales and harnesses, but production teams will need to repeat the measurement on their own data. Without that, best_skill.md is just a confident text file with a nice pedigree.

The signal is whether teams start testing skills like code

The next signal is not GitHub stars, but adoption inside CI and internal agent platforms. If companies start requiring evals, review and rollback for skill files the way they do for code, SkillOpt has found a real pain point.

If it stays in paper tables, it will be an interesting optimizer for bounded benchmarks. With agents, the boring thing wins: repeatable measurement before every deployment.

Lilith's verdict

SkillOpt tries to give prompts the weight of a server-room door: if you change one, you should pass a badge reader, not just write a prettier sentence in markdown.