Your Agent Skill Is Trainable State
AI agents have two layers.
One layer is frozen. That is the model. For most teams, the weights sit behind an API. You do not change them.
The other layer is yours. That is the skill, the system prompt, the instruction file, the agent memory, the procedural guide that tells the agent how to work.
For a while, most of us treated that second layer like documentation.
Write it once.
Paste it into the agent.
Patch it when something breaks.
That made sense when agents were mostly assistants. But once agents start doing real work, that model starts to feel too loose.
A skill is not just prose.
It is the part of the agent system you can actually improve.
And if you can improve it, you should probably train it.
The shift
A May 2026 paper called SkillOpt makes a simple but important argument: agent skills should be optimized like external state for a frozen model.
That sounds academic, but the idea is practical.
Instead of hand-writing a skill and hoping it works, you run the agent against real tasks. You look at what failed. You propose small edits. You test those edits against a held-out set. You keep the edit only if it improves performance.
That is the difference between prompt tinkering and training.
Prompt tinkering says:
“This sounds better.”
Skill training says:
“This scored better.”
That distinction matters.
Key takeaways
- A skill is not just a prompt. It is trainable external state for a frozen agent.
- Good skill optimization needs a few basic controls: small edits, validation, memory of failed edits, and separation between fast changes and slower lessons.
- In SkillOpt, most of the gains came from only 1 to 4 accepted edits.
- The final skills stayed compact, under 2,000 tokens.
- The optimization happens offline, so the deployed agent does not get slower at runtime.
A skill should learn from failure
Most agent instruction files never see their own failures.
A team writes a CLAUDE.md, AGENTS.md, or skill file. It explains the repo, names a few rules, gives some preferences, and maybe includes examples.
That is useful. But it is also static.
The agent can fail the same way again and again because the skill has no feedback loop. Somebody has to notice the failure, remember it, and update the file manually.
That does not scale well.
The better pattern looks more like this:
- Run the agent on a batch of real tasks.
- Score the result.
- Separate what worked from what failed.
- Propose a small skill edit.
- Test the edited skill on a held-out set.
- Keep the edit only if it wins.
- Remember the edits that lost.
That is not magic. It is just discipline.
But that discipline is what turns a skill from a static document into something that can improve.
The edit budget matters
One of the strongest ideas in SkillOpt is the “textual learning rate.”
In normal training, you do not let every update swing wildly. You control the step size.
The same idea applies to agent skills.
Do not rewrite the whole skill because one task failed. That is how you lose good instructions, introduce contradictions, and overfit to one weird example.
Make one or two focused edits.
Then test.
Small changes preserve what already works while giving the skill room to improve.
That feels obvious once you say it, but a lot of agent workflows do the opposite. They ask a model to rewrite the whole instruction file after every batch of feedback. That can sound cleaner, but cleaner is not the same as better.
A skill should improve because it performs better, not because it reads better.
The validation gate is the real unlock
The most important part is the gate.
A proposed edit should not ship just because it sounds smart.
It should ship because it improves the score on tasks the optimizer did not train against.
That is what prevents the skill from chasing noise.
It also changes the role of the model. The optimizer model can suggest edits, but it does not get final authority. Measurement does.
That is the right shape for agent systems more broadly.
Agents can propose.
Tools can check.
Policy can enforce.
Humans can approve.
The future is not agents running wild, and it is not humans manually reviewing every tiny step.
It is measured autonomy.
Failed edits are useful
One of my favorite details in SkillOpt is that rejected edits are kept.
That matters.
A failed edit is not wasted work. It is evidence.
If an edit lowered the score, the system remembers it and uses that memory to avoid making the same mistake again. That is how the skill avoids looping on the same bad idea.
This maps really well to real engineering teams.
We already do this informally. We remember bad migrations, bad abstractions, bad security exceptions, bad deployment patterns. The problem is that this memory usually lives in people’s heads, Slack threads, and old PR comments.
Agent systems need that memory in the loop.
What teams can do now
You do not need a full research implementation to use the idea.
Start small.
Pick one important agent workflow. Maybe it is writing tests. Maybe it is fixing lint failures. Maybe it is updating Terraform. Maybe it is reviewing dependency changes.
Then build a tiny evaluation set.
Real tasks.
Known-good outputs.
A simple scoring method.
A skill file you can edit.
From there, treat the skill like something you train.
- Change one or two instructions at a time.
- Run the eval.
- Keep only changes that improve the score.
- Roll back changes that do not.
- Keep a short list of failed edits so you do not repeat them.
- Keep the skill compact and procedural.
This is especially important for files like AGENTS.md and CLAUDE.md.
Those files should not become giant manuals. They should be maps. They should route the agent to the right context, the right commands, the right constraints, and the right project-specific rules.
If the skill is too long, the agent drowns in it.
If the skill is measured, it can stay small and still get better.
Having problems with software at speed? Turen can help. Sign up for a 14-day trial at https://turen.io or view the live demo at https://try.turen.io