Summary — 2026-05-28

Fourth pass — 2026-05-28 (SkillOpt — text-space skill optimiser)

One item this pass. Surfaced by Anthony from a viral social post claiming "you can now train agent skills the exact same way you train AI models." Verified upstream as Microsoft Research, arxiv 2605.23904 — SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Text-space optimiser that treats a single SKILL.md as the trainable artefact, runs a base/optimiser loop with held-out benchmark gating on accept/reject. Same family as DSPy / GEPA / TextGrad; better-engineered for the SKILL.md substrate Anthony already runs.

Item	Source	P	Q	S	Total	Move
SkillOpt — text-space SKILL.md optimiser with benchmark-gated edits (Microsoft Research, MIT-licensed)	Anthony viral-post forward 2026-05-28; verified via firecrawl_search → arxiv 2605.23904 + microsoft.github.io/SkillOpt	2	3	2	7/9	Act on. Scaffold pilot playbook at `SuperSebastian/playbooks/skillopt-eval.md`. Pair with Taste Pipeline (cold-start) as the warm-loop counterpart. Pilot target: the SKILL.md with the most labelled negatives in `failures.md` (current candidate: `ai-image` brand-anchor wrapper).

What the paper actually does

Trainable artefact: a single natural-language SKILL.md. Model weights frozen. Edits live in text-space only — transferable across model providers verbatim.
Loop: base model executes a task with the current SKILL.md. Optimiser model reads the trace + the held-out benchmark score. Optimiser proposes a rewrite. Rewrite is rolled forward on the eval set. Accepted only if score lifts. Rejected edits are discarded — no regression ships.
Meta-guidance: the optimiser maintains its own meta-prompt about which kinds of edits work for this skill. Prepended to its own future calls. Not shipped with the target model.
Beats hand-crafted prompts + TextGrad on the paper's benchmarks. Marketing-copy claim — but it's an arxiv-grade benchmark, not a vibe claim.

Why P=2 (not 3)

Anthony hand-edits SKILL.md files constantly. Real time recovery available. But: SkillOpt needs (a) a base+optimiser wiring per skill, (b) a held-out eval set with labelled ground-truth, (c) compute budget on the optimiser side. Setup tax is real. Not every SKILL.md has scoreable output — pm:pm outputs a status report, not a benchmarkable answer; deploy outputs a live URL, hard to grade. The skills that score well are the ones where wrong output has shape: ai-image (brand-anchor mismatch), de-Claudification (Will's slop scorecard), Sebastian four-line gate (rule-class miss). Promotes to P=3 after first pilot run shows real time recovery on at least one of these.

Why Q=3

Three structural quality moves we don't currently run anywhere:

Benchmark-gated rejection. Optimiser cannot ship an edit that regresses the eval score. This is what the four-line SEB-gate (Rule 17, locked 2026-05-24) is trying to do via discipline — SkillOpt does it via compute. Discipline drifts; benchmarks don't.
Held-out eval set per skill. Forces the skill author to specify, in advance, what "this skill is working" means. Currently implicit (and breakable).
Meta-guidance accumulation. The optimiser learns which edit shapes work for this skill. Generalises beyond SkillOpt — any skill we tune ad-hoc gets a record of what worked.

Direct mitigation of the same failure class the Taste Pipeline 8/9 addresses — single-pass, vibe-based skill curation drifts. SkillOpt is the warm-loop counterpart: Taste seeds the skill from references; SkillOpt iterates it from execution telemetry.

Why S=2 (not 3)

Real Articulate product surface: "your brand-voice agent / mentor-in-pocket gets measurably better every week from your own corrections." Slots inside MEP brand-voice work and the Hermes Agent deployments — BossCouple pilot's Curator could be replaced or augmented with SkillOpt's benchmark-gated edits. Marketing Psychology Toolbox skills, Roche-Debbie mentor SKILL.md files, the per-client brand-voice skills — all candidates.

But: open-source, so the moat is not the optimiser. Moat is (a) the curated skill, (b) the labelled-failures ground-truth, (c) Anthony's taste defining what "better" means. S=3 only when a client signs a contract scoping "self-improving skill" as a discrete line item.

Class-match against `failures.md`

2026-05-24 SEB-gate four-strike (Strikes 1-4 in one day — Hermes provenance, SSH user anthony vs anthonybooth twice, computer-use mouse-driving when post-to-x SKILL.md was live). Each was a labelled negative against the Sebastian skill itself. SkillOpt's benchmark-gated loop is the structural version of the four-line gate. Direct hit.
2026-05-24 FAIL-VISUAL-MISREAD-OGILVY-MARK + FAIL-WORDMARK-PROXY-MISCALIBRATED — single-model misreads of brand references propagated downstream. ai-image SKILL.md's brand-anchor lookup is the wrapper that should have caught them. Now we have labels.
2026-05-18 Takis four-brand-cycle (brand-spec drift). Labelled negative against any brand-voice SKILL.md.
2026-05-20 mymentor carousel (wrong-tool hero-vs-editorial). Labelled negative against ai-image skill's tool-selection logic.
Hermes Curator (live, BossCouple pilot) does an adjacent thing heuristically — cheap aux model auto-grades + consolidates ~/.hermes/skills/. SkillOpt is the benchmark-gated engineering of what Curator does on vibes. Decision needed: Curator + SkillOpt parallel, or SkillOpt replaces Curator. Filed for a SuperSebastian/decisions.md entry.

Pair with Taste Pipeline — not duplicate

Taste Pipeline (8/9, third pass) = cold-start skill curation from references. Six steps: refs → dual-model analysis → anonymised fusion → chunked synthesis → rule-set → skill. Produces v0 SKILL.md.

SkillOpt = warm-loop skill optimisation from execution telemetry + held-out eval set. Reads v0, runs it, gates edits on benchmark. Produces vN SKILL.md.

Combined: Taste creates the skill from your refs once; SkillOpt iterates it forever from your corrections. The two playbooks should ship paired.

Pilot path

Scaffold ~/ClaudeWork/Agents/services/SuperSebastian/playbooks/skillopt-eval.md (this pass). Pilot recipe:

Subject selection. Rank candidates by labelled-failure count in failures.md. Current ranking: (a) ai-image brand-anchor wrapper — 3-4 labelled negatives in the visual-misread axis; (b) Sebastian four-line gate — 4 strikes on 2026-05-24 alone; (c) de-Claudification (Will) — every "AI slopp" failures.md entry is a label.
Eval set construction. Pull labelled negatives from failures.md matching the subject skill. Add positive examples (clean shipped artefacts) from the project deliverables/ folders. Minimum N=10 labelled examples per skill, 70/30 train/held-out split.
Substrate. Claude Sonnet as base, OpenAI GPT-5.5 as optimiser. Same dual-frontier debias shape the Taste Pipeline uses. Allowlist api.openai.com already live. OpenAI key still pending capture in credentials.md — same blocker as the Taste Pipeline pilot (third pass §Pilot status). Unblock once, both pilots clear.
Cost envelope. Optimiser loop runs roughly $1-5 per iteration; converge in 5-15 iterations per pilot skill = $5-75. Inside the $10/day gate on a single-skill run. Multi-skill batch needs day-budgeting.
Acceptance criteria. Pilot succeeds if SkillOpt-optimised SKILL.md beats the current hand-edited version by ≥10pp on the held-out eval set. Lower than 10pp means the discipline already in place is doing most of the work — keep SkillOpt as a quarterly tune, not a continuous loop.

Re-score triggers

(a) First pilot run on a single SKILL.md (recommended subject: ai-image) lifts brand-anchor correctness ≥10pp vs hand-curated baseline → P promotes to 3, total 8/9.
(b) First paid client engagement scopes "self-improving skill" as a discrete line item (Marketing Engine Pilot add-on, Hermes deployment with continuous-tune) → S promotes to 3, total 8-9/9.
(c) Hermes Curator gets replaced by SkillOpt in the BossCouple pilot loop → confirms substrate fit. No score change, but locks the pair (Taste + SkillOpt + Curator-as-SkillOpt) as the canonical skill-lifecycle stack.
(d) DSPy or GEPA ships an equivalent benchmark-gated text-space loop with looser setup tax → re-evaluate which optimiser becomes Articulate default.
(e) Failure mode discovered — SkillOpt-optimised SKILL.md beats benchmark but ships subtly off-brand voice or violates a hard rule the eval set didn't capture. Belinda + Will become the standing gate alongside the benchmark. Same shape as the four-line SEB-gate: compute is necessary, not sufficient.