Articulate·Hype RadarOverviewDaily logSummary · 2026-05-28

Summary — 2026-05-28

Fourth pass — 2026-05-28 (SkillOpt — text-space skill optimiser)

One item this pass. Surfaced by Anthony from a viral social post claiming "you can now train agent skills the exact same way you train AI models." Verified upstream as Microsoft Research, arxiv 2605.23904 — SkillOpt: Executive Strategy for Self-Evolving Agent Skills. Text-space optimiser that treats a single SKILL.md as the trainable artefact, runs a base/optimiser loop with held-out benchmark gating on accept/reject. Same family as DSPy / GEPA / TextGrad; better-engineered for the SKILL.md substrate Anthony already runs.

Item Source P Q S Total Move
SkillOpt — text-space SKILL.md optimiser with benchmark-gated edits (Microsoft Research, MIT-licensed) Anthony viral-post forward 2026-05-28; verified via firecrawl_search → arxiv 2605.23904 + microsoft.github.io/SkillOpt 2 3 2 7/9 Act on. Scaffold pilot playbook at SuperSebastian/playbooks/skillopt-eval.md. Pair with Taste Pipeline (cold-start) as the warm-loop counterpart. Pilot target: the SKILL.md with the most labelled negatives in failures.md (current candidate: ai-image brand-anchor wrapper).

What the paper actually does

Why P=2 (not 3)

Anthony hand-edits SKILL.md files constantly. Real time recovery available. But: SkillOpt needs (a) a base+optimiser wiring per skill, (b) a held-out eval set with labelled ground-truth, (c) compute budget on the optimiser side. Setup tax is real. Not every SKILL.md has scoreable output — pm:pm outputs a status report, not a benchmarkable answer; deploy outputs a live URL, hard to grade. The skills that score well are the ones where wrong output has shape: ai-image (brand-anchor mismatch), de-Claudification (Will's slop scorecard), Sebastian four-line gate (rule-class miss). Promotes to P=3 after first pilot run shows real time recovery on at least one of these.

Why Q=3

Three structural quality moves we don't currently run anywhere:

Direct mitigation of the same failure class the Taste Pipeline 8/9 addresses — single-pass, vibe-based skill curation drifts. SkillOpt is the warm-loop counterpart: Taste seeds the skill from references; SkillOpt iterates it from execution telemetry.

Why S=2 (not 3)

Real Articulate product surface: "your brand-voice agent / mentor-in-pocket gets measurably better every week from your own corrections." Slots inside MEP brand-voice work and the Hermes Agent deployments — BossCouple pilot's Curator could be replaced or augmented with SkillOpt's benchmark-gated edits. Marketing Psychology Toolbox skills, Roche-Debbie mentor SKILL.md files, the per-client brand-voice skills — all candidates.

But: open-source, so the moat is not the optimiser. Moat is (a) the curated skill, (b) the labelled-failures ground-truth, (c) Anthony's taste defining what "better" means. S=3 only when a client signs a contract scoping "self-improving skill" as a discrete line item.

Class-match against failures.md

Pair with Taste Pipeline — not duplicate

Taste Pipeline (8/9, third pass) = cold-start skill curation from references. Six steps: refs → dual-model analysis → anonymised fusion → chunked synthesis → rule-set → skill. Produces v0 SKILL.md.

SkillOpt = warm-loop skill optimisation from execution telemetry + held-out eval set. Reads v0, runs it, gates edits on benchmark. Produces vN SKILL.md.

Combined: Taste creates the skill from your refs once; SkillOpt iterates it forever from your corrections. The two playbooks should ship paired.

Pilot path

Scaffold ~/ClaudeWork/Agents/services/SuperSebastian/playbooks/skillopt-eval.md (this pass). Pilot recipe:

  1. Subject selection. Rank candidates by labelled-failure count in failures.md. Current ranking: (a) ai-image brand-anchor wrapper — 3-4 labelled negatives in the visual-misread axis; (b) Sebastian four-line gate — 4 strikes on 2026-05-24 alone; (c) de-Claudification (Will) — every "AI slopp" failures.md entry is a label.
  2. Eval set construction. Pull labelled negatives from failures.md matching the subject skill. Add positive examples (clean shipped artefacts) from the project deliverables/ folders. Minimum N=10 labelled examples per skill, 70/30 train/held-out split.
  3. Substrate. Claude Sonnet as base, OpenAI GPT-5.5 as optimiser. Same dual-frontier debias shape the Taste Pipeline uses. Allowlist api.openai.com already live. OpenAI key still pending capture in credentials.md — same blocker as the Taste Pipeline pilot (third pass §Pilot status). Unblock once, both pilots clear.
  4. Cost envelope. Optimiser loop runs roughly $1-5 per iteration; converge in 5-15 iterations per pilot skill = $5-75. Inside the $10/day gate on a single-skill run. Multi-skill batch needs day-budgeting.
  5. Acceptance criteria. Pilot succeeds if SkillOpt-optimised SKILL.md beats the current hand-edited version by ≥10pp on the held-out eval set. Lower than 10pp means the discipline already in place is doing most of the work — keep SkillOpt as a quarterly tune, not a continuous loop.

Re-score triggers