🔒 Protected via Cloudflare Access

Worst-scoring seeds — golden95-cfai-v6

91 successful parses, 4 parse fails. Ranked by mean judge score across available numeric judges.

Hard parse failures

agent.env.status — parse failed
gmail.api.ready — parse failed
gmail.inbox.scan — parse failed
local.files.ready — parse failed

Worst successful seeds

1. `project.retro.run` — avg 4.4/10

parent-child-coherence: 0.0/10 — The parent contract only states 'parent' which provides no meaningful guarantee about what context or state exists for the child seed to build upon.
contract-independence: 2.0/10 — The contract directly parrots the prompt's structure and terminology ('three sections', 'what went well', 'what didn't go well', 'what to try differently', 'action owner', 'top‑3 improvement list') with only 'is present in context' appended.
referential-clarity: 4.0/10 — Contract uses 'is present in context' and assumes external context exists.

2. `project.email.update` — avg 4.9/10

contract-independence: 2.0/10 — Directly parrots the prompt's requirements, then appends 'exists in the context' without asserting real world state.
distinctiveness: 2.0/10 — Judge saw it as weakly differentiated from neighboring seeds.
parent-child-coherence: 2.0/10 — Child assumes a status report exists, but the parent contract had collapsed.

3. `data.report.narrative` — avg 4.9/10

parent-child-coherence: 0.0/10 — Child assumes data findings exist, but parent guarantee is effectively empty.
contract-independence: 2.0/10 — Contract parrots the prompt's wording and constraints.
navigability: 3.0/10 — Hard to verify whether the summary is truly plain-language and audience-appropriate.

4. `research.competitor.scan` — avg 5.0/10

contract-independence: 2.0/10 — Prompt restatement with 'exists in context' added.
token-density: 2.0/10 — Too much padding: 'structured', 'containing', 'exists in the context', etc.
navigability: 3.0/10 — Contract lacks crisp observable pass/fail criteria.

5. `project.status.report` — avg 5.0/10

parent-child-coherence: 0.0/10 — Parent contract provides no usable project state.
contract-independence: 2.0/10 — Essentially a copy-paste summary of the prompt.
referential-clarity: 3.0/10 — Relies on 'exists in context' and vague report framing.

6. `gmail.priority.brief` — avg 5.1/10

contract-independence: 2.0/10 — Parrots the seed prompt almost verbatim.
referential-clarity: 2.0/10 — Uses definite references like 'the top three unread emails'.
navigability: 3.0/10 — Too subjective to verify cleanly.

7. `research.market.size` — avg 5.2/10

contract-independence: 2.0/10 — Prompt terminology (TAM/SAM/SOM, confidence, citations) copied straight through.
parent-child-coherence: 3.0/10 — Parent only guarantees a general topic, not market-sizing context.
referential-clarity: 4.0/10 — Still leans on 'exists in context'.

8. `web.news.digest` — avg 5.2/10

contract-independence: 2.0/10 — Repeats article count + fields + grouping requirements.
token-density: 3.0/10 — Verbose and padded.
referential-clarity: 4.0/10 — Vague deliverable framing.

9. `agent.workflow.log` — avg 5.2/10

contract-independence: 2.0/10 — Reads like a summary of the prompt, not an independent post-state.
parent-child-coherence: 2.0/10 — Parent doesn't guarantee a completed workflow exists.
token-density: 3.0/10 — Too many filler phrases.

10. `project.tasks.breakdown` — avg 5.3/10

parent-child-coherence: 0.0/10 — Parent contract collapsed, so child assumptions float.
contract-independence: 2.0/10 — Restates the prompt field-by-field.
token-density: 3.0/10 — Padded phrasing.

11. `email.thread.load` — avg 5.4/10

contract-independence: 2.0/10 — Copies prompt details into the contract.
referential-clarity: 2.0/10 — Uses definite refs like 'the full email thread'.
token-density: 4.0/10 — Extra passive phrasing.

12. `regex.pattern.build` — avg 5.4/10

contract-independence: 2.0/10 — Direct prompt restatement.
distinctiveness: 3.0/10 — Judge saw weak differentiation.
referential-clarity: 4.0/10 — Overuses user-bound phrasing.

Bottom 5 by judge

contract-independence

project.retro.run — 2.0/10
project.email.update — 2.0/10
data.report.narrative — 2.0/10
research.competitor.scan — 2.0/10
project.status.report — 2.0/10

referential-clarity

gmail.priority.brief — 2.0/10
email.thread.load — 2.0/10
disk.reclaim.audit — 2.0/10
meeting.notes.process — 2.0/10
project.status.report — 3.0/10

navigability

data.report.narrative — 3.0/10
research.competitor.scan — 3.0/10
gmail.priority.brief — 3.0/10
profile.intro.write — 3.0/10
project.retro.run — 4.0/10

contract-concreteness

project.retro.run — 8.0/10
project.email.update — 8.0/10
data.report.narrative — 8.0/10
research.market.size — 8.0/10
agent.workflow.log — 8.0/10

token-density

research.competitor.scan — 2.0/10
research.sales.battlecard — 2.0/10
data.report.narrative — 3.0/10
web.news.digest — 3.0/10
agent.workflow.log — 3.0/10

slug-compression

web.news.digest — 4.0/10
topic.focus.set — 4.0/10
api.endpoint.test — 4.0/10
meeting.transcript.clean — 4.0/10
data.report.narrative — 6.0/10

distinctiveness

project.email.update — 2.0/10
api.endpoint.test — 2.0/10
seed.seed.seed — 2.0/10
regex.pattern.build — 3.0/10
web.price.compare — 3.0/10

parent-child-coherence

project.retro.run — 0.0/10
data.report.narrative — 0.0/10
project.status.report — 0.0/10
project.tasks.breakdown — 0.0/10
project.email.update — 2.0/10

grammar

gmail.priority.brief — 4.0/10
web.news.digest — 4.0/10
agent.workflow.log — 4.0/10
regex.pattern.build — 4.0/10
meeting.transcript.clean — 4.0/10

Worst-scoring seeds — golden95-cfai-v6

Hard parse failures

Worst successful seeds

1. project.retro.run — avg 4.4/10

2. project.email.update — avg 4.9/10

3. data.report.narrative — avg 4.9/10

4. research.competitor.scan — avg 5.0/10

5. project.status.report — avg 5.0/10

6. gmail.priority.brief — avg 5.1/10

7. research.market.size — avg 5.2/10

8. web.news.digest — avg 5.2/10

9. agent.workflow.log — avg 5.2/10

10. project.tasks.breakdown — avg 5.3/10

11. email.thread.load — avg 5.4/10

12. regex.pattern.build — avg 5.4/10

Bottom 5 by judge

contract-independence

referential-clarity

navigability

contract-concreteness

token-density

slug-compression

distinctiveness

parent-child-coherence

grammar

1. `project.retro.run` — avg 4.4/10

2. `project.email.update` — avg 4.9/10

3. `data.report.narrative` — avg 4.9/10

4. `research.competitor.scan` — avg 5.0/10

5. `project.status.report` — avg 5.0/10

6. `gmail.priority.brief` — avg 5.1/10

7. `research.market.size` — avg 5.2/10

8. `web.news.digest` — avg 5.2/10

9. `agent.workflow.log` — avg 5.2/10

10. `project.tasks.breakdown` — avg 5.3/10

11. `email.thread.load` — avg 5.4/10

12. `regex.pattern.build` — avg 5.4/10