Summary — 2026-05-28
Fourth pass — 2026-05-28 (SkillOpt — text-space skill optimiser)
One item this pass. Surfaced by Anthony from a viral social post claiming "you can now train agent skills the exact same way you train AI models." Verified upstream as Microsoft Research, arxiv 2605.23904 — SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Text-space optimiser that treats a single SKILL.md as the trainable artefact, runs a base/optimiser loop with held-out benchmark gating on accept/reject. Same family as DSPy / GEPA / TextGrad; better-engineered for the SKILL.md substrate Anthony already runs.
| Item | Source | P | Q | S | Total | Move |
|---|---|---|---|---|---|---|
| SkillOpt — text-space SKILL.md optimiser with benchmark-gated edits (Microsoft Research, MIT-licensed) | Anthony viral-post forward 2026-05-28; verified via firecrawl_search → arxiv 2605.23904 + microsoft.github.io/SkillOpt | 2 | 3 | 2 | 7/9 | Act on. Scaffold pilot playbook at SuperSebastian/playbooks/skillopt-eval.md. Pair with Taste Pipeline (cold-start) as the warm-loop counterpart. Pilot target: the SKILL.md with the most labelled negatives in failures.md (current candidate: ai-image brand-anchor wrapper). |
What the paper actually does
- Trainable artefact: a single natural-language
SKILL.md. Model weights frozen. Edits live in text-space only — transferable across model providers verbatim. - Loop: base model executes a task with the current SKILL.md. Optimiser model reads the trace + the held-out benchmark score. Optimiser proposes a rewrite. Rewrite is rolled forward on the eval set. Accepted only if score lifts. Rejected edits are discarded — no regression ships.
- Meta-guidance: the optimiser maintains its own meta-prompt about which kinds of edits work for this skill. Prepended to its own future calls. Not shipped with the target model.
- Beats hand-crafted prompts + TextGrad on the paper's benchmarks. Marketing-copy claim — but it's an arxiv-grade benchmark, not a vibe claim.
Why P=2 (not 3)
Anthony hand-edits SKILL.md files constantly. Real time recovery available. But: SkillOpt needs (a) a base+optimiser wiring per skill, (b) a held-out eval set with labelled ground-truth, (c) compute budget on the optimiser side. Setup tax is real. Not every SKILL.md has scoreable output — pm:pm outputs a status report, not a benchmarkable answer; deploy outputs a live URL, hard to grade. The skills that score well are the ones where wrong output has shape: ai-image (brand-anchor mismatch), de-Claudification (Will's slop scorecard), Sebastian four-line gate (rule-class miss). Promotes to P=3 after first pilot run shows real time recovery on at least one of these.
Why Q=3
Three structural quality moves we don't currently run anywhere:
- Benchmark-gated rejection. Optimiser cannot ship an edit that regresses the eval score. This is what the four-line SEB-gate (Rule 17, locked 2026-05-24) is trying to do via discipline — SkillOpt does it via compute. Discipline drifts; benchmarks don't.
- Held-out eval set per skill. Forces the skill author to specify, in advance, what "this skill is working" means. Currently implicit (and breakable).
- Meta-guidance accumulation. The optimiser learns which edit shapes work for this skill. Generalises beyond SkillOpt — any skill we tune ad-hoc gets a record of what worked.
Direct mitigation of the same failure class the Taste Pipeline 8/9 addresses — single-pass, vibe-based skill curation drifts. SkillOpt is the warm-loop counterpart: Taste seeds the skill from references; SkillOpt iterates it from execution telemetry.
Why S=2 (not 3)
Real Articulate product surface: "your brand-voice agent / mentor-in-pocket gets measurably better every week from your own corrections." Slots inside MEP brand-voice work and the Hermes Agent deployments — BossCouple pilot's Curator could be replaced or augmented with SkillOpt's benchmark-gated edits. Marketing Psychology Toolbox skills, Roche-Debbie mentor SKILL.md files, the per-client brand-voice skills — all candidates.
But: open-source, so the moat is not the optimiser. Moat is (a) the curated skill, (b) the labelled-failures ground-truth, (c) Anthony's taste defining what "better" means. S=3 only when a client signs a contract scoping "self-improving skill" as a discrete line item.
Class-match against failures.md
- 2026-05-24 SEB-gate four-strike (Strikes 1-4 in one day — Hermes provenance, SSH user
anthonyvsanthonyboothtwice, computer-use mouse-driving when post-to-x SKILL.md was live). Each was a labelled negative against the Sebastian skill itself. SkillOpt's benchmark-gated loop is the structural version of the four-line gate. Direct hit. - 2026-05-24
FAIL-VISUAL-MISREAD-OGILVY-MARK+FAIL-WORDMARK-PROXY-MISCALIBRATED— single-model misreads of brand references propagated downstream.ai-imageSKILL.md's brand-anchor lookup is the wrapper that should have caught them. Now we have labels. - 2026-05-18 Takis four-brand-cycle (brand-spec drift). Labelled negative against any brand-voice SKILL.md.
- 2026-05-20 mymentor carousel (wrong-tool hero-vs-editorial). Labelled negative against
ai-imageskill's tool-selection logic. - Hermes Curator (live, BossCouple pilot) does an adjacent thing heuristically — cheap aux model auto-grades + consolidates
~/.hermes/skills/. SkillOpt is the benchmark-gated engineering of what Curator does on vibes. Decision needed: Curator + SkillOpt parallel, or SkillOpt replaces Curator. Filed for aSuperSebastian/decisions.mdentry.
Pair with Taste Pipeline — not duplicate
Taste Pipeline (8/9, third pass) = cold-start skill curation from references. Six steps: refs → dual-model analysis → anonymised fusion → chunked synthesis → rule-set → skill. Produces v0 SKILL.md.
SkillOpt = warm-loop skill optimisation from execution telemetry + held-out eval set. Reads v0, runs it, gates edits on benchmark. Produces vN SKILL.md.
Combined: Taste creates the skill from your refs once; SkillOpt iterates it forever from your corrections. The two playbooks should ship paired.
Pilot path
Scaffold ~/ClaudeWork/Agents/services/SuperSebastian/playbooks/skillopt-eval.md (this pass). Pilot recipe:
- Subject selection. Rank candidates by labelled-failure count in
failures.md. Current ranking: (a)ai-imagebrand-anchor wrapper — 3-4 labelled negatives in the visual-misread axis; (b)Sebastianfour-line gate — 4 strikes on 2026-05-24 alone; (c)de-Claudification(Will) — every "AI slopp" failures.md entry is a label. - Eval set construction. Pull labelled negatives from
failures.mdmatching the subject skill. Add positive examples (clean shipped artefacts) from the project deliverables/ folders. Minimum N=10 labelled examples per skill, 70/30 train/held-out split. - Substrate. Claude Sonnet as base, OpenAI GPT-5.5 as optimiser. Same dual-frontier debias shape the Taste Pipeline uses. Allowlist
api.openai.comalready live. OpenAI key still pending capture incredentials.md— same blocker as the Taste Pipeline pilot (third pass §Pilot status). Unblock once, both pilots clear. - Cost envelope. Optimiser loop runs roughly $1-5 per iteration; converge in 5-15 iterations per pilot skill = $5-75. Inside the $10/day gate on a single-skill run. Multi-skill batch needs day-budgeting.
- Acceptance criteria. Pilot succeeds if SkillOpt-optimised SKILL.md beats the current hand-edited version by ≥10pp on the held-out eval set. Lower than 10pp means the discipline already in place is doing most of the work — keep SkillOpt as a quarterly tune, not a continuous loop.
Re-score triggers
- (a) First pilot run on a single SKILL.md (recommended subject:
ai-image) lifts brand-anchor correctness ≥10pp vs hand-curated baseline → P promotes to 3, total 8/9. - (b) First paid client engagement scopes "self-improving skill" as a discrete line item (Marketing Engine Pilot add-on, Hermes deployment with continuous-tune) → S promotes to 3, total 8-9/9.
- (c) Hermes Curator gets replaced by SkillOpt in the BossCouple pilot loop → confirms substrate fit. No score change, but locks the pair (Taste + SkillOpt + Curator-as-SkillOpt) as the canonical skill-lifecycle stack.
- (d) DSPy or GEPA ships an equivalent benchmark-gated text-space loop with looser setup tax → re-evaluate which optimiser becomes Articulate default.
- (e) Failure mode discovered — SkillOpt-optimised SKILL.md beats benchmark but ships subtly off-brand voice or violates a hard rule the eval set didn't capture. Belinda + Will become the standing gate alongside the benchmark. Same shape as the four-line SEB-gate: compute is necessary, not sufficient.