Guess Selection & Evaluation
Input: $ARGUMENTS
Interpretations
Before executing, identify which interpretation matches the user’s input:
Interpretation 1 — Filter guesses from /gg output: The user has a large set of guesses (from /gg or similar) and wants to systematically evaluate which are worth pursuing, which are wrong, and which are critical. Interpretation 2 — Select best options from a list: The user has enumerated options and wants to select the best one(s) based on criteria. Default: select top 20 unless user specifies a different number. Interpretation 3 — Triage a backlog: The user has many items and wants to sort them into act-on / defer / eliminate buckets. Default: surface top 20 unless user specifies a different number.
If ambiguous, ask: “I can help with filtering guesses from an analysis, selecting the best options from a list, or triaging a backlog — which fits?” If clear from context, proceed with the matching interpretation.
Depth Scaling
Default: 2x. Parse depth from $ARGUMENTS if specified (e.g., “/selection 4x [input]”).
| Depth | Min Guesses Evaluated | Min ARAW Depth per Critical | Min Selection Criteria | Min Elimination Justifications | Pairwise Finals |
|---|---|---|---|---|---|
| 1x | 20 | Quick (1 AR + 1 AW) | 3 | 1-line | No |
| 2x | 50 | Standard (2 AR + 2 AW) | 5 | 2-3 lines | No |
| 4x | 100 | Deep (3 AR + 3 AW) | 7 | Paragraph | No |
| 8x | 200 | Full ARAW | 10 | Full argument | Yes (top 15) |
| 16x | All | Full ARAW + subagent | 12 | Full argument + evidence | Yes (top 15) |
| 32x | All | Full ARAW + partial right | 12 | Full argument + evidence | Yes (top 15) |
The Process
Step 1: Inventory
List all guesses/options being evaluated. Group by source dimension if from /gg output.
INVENTORY: [N] guesses to evaluate
Source: [/gg output, brainstorm, enumeration, etc.]
Groups:
- [Group 1]: [N] guesses
- [Group 2]: [N] guesses
...
Step 2: Knockout Filter
Before any evaluation, apply two knockout questions to every guess:
- Action divergence: “If this guess is wrong, does anyone’s action change?” If NO → DEFER (not eliminate — it may become relevant later).
- Reachability: “Can this be investigated or acted on within the relevant timeframe?” If NO → DEFER.
Guesses that fail BOTH knockouts are moved to a DEFERRED list and excluded from further evaluation. Guesses that fail one knockout are flagged but proceed.
KNOCKOUT FILTER:
Passed: [N] guesses proceed
Deferred: [N] guesses (fail both knockouts)
Flagged: [N] guesses (fail one knockout — proceed with flag)
Step 3: Cluster by Derivation
If guesses have derivation tags (e.g., [D: AGENT], [D: SCAMPER-S]), group them by source. If no tags, cluster by theme.
CLUSTERS:
- [Cluster 1]: [N] guesses — [description]
- [Cluster 2]: [N] guesses — [description]
...
The final selection MUST include at least 1 item from each significant cluster (clusters with 3+ guesses). This prevents selecting 10 items from the same theme while ignoring others.
Step 4: Define Selection Criteria
Before evaluating, establish what “good” means:
| Criterion | Weight | Description |
|---|---|---|
| Actionability | HIGH | Can this be acted on? Is there a concrete next step? |
| Impact | HIGH | If true/chosen, how much does it change the outcome? |
| Testability | MED | Can this be verified or falsified? |
| Novelty | MED | Does this add information beyond what’s already known? |
| Independence | MED | Is this distinct from other guesses, or redundant? |
| Downstream dependencies | MED | Does resolving this guess unblock or clarify other guesses? |
| Confidence | LOW | How likely is this to be correct? (Low weight because low-confidence high-impact items are valuable) |
Add domain-specific criteria as needed. After rapid triage, check which criteria have low variance across CRITICAL/STRONG guesses (all score 4-5). Report non-differentiating criteria — they’re included but didn’t affect rankings.
Polarity Balance Check (if guesses have [DIR:], [USE:], [ORIENT:] tags)
Before proceeding to triage, count the polarity distribution of all guesses entering selection. Then enforce these balance rules during triage and final selection:
Direction balance: Skeptical guesses naturally score higher on impact-if-wrong (because “maybe everything is wrong” always has HIGH divergence). Correct for this by evaluating CONSTRUCTIVE and SKEPTICAL guesses in SEPARATE pools, then merging. Final selection must be ≥40% CONSTRUCTIVE when the input is Interpretation 2 (possibilities for an unknown).
Purpose balance: Final selection MUST include ≥3 OUTPUT or BOTH guesses. A selection of all THINKING guesses produces insight without action. If your top 20 is all “should we even be doing this?” and no “here’s what to build,” the selection failed.
Orientation balance: Final selection must include at least 1 EXTEND, 1 REDIRECT, and 1 NEUTRAL guess. Over-indexing any orientation is a sign of bias, not rigor.
The core failure to avoid: Skeptical/meta guesses feel intellectually superior. “This whole project might be overengineered” feels like a deeper insight than “consolidate these 5 files into one.” But when someone asks “what should I do next,” the constructive guess IS the answer and the skeptical guess is context. Select for answers first, context second.
Step 5: Rapid Triage (All Guesses)
Sort every guess into one of four buckets:
| Bucket | Symbol | Meaning | Action |
|---|---|---|---|
| CRITICAL | ★ | High impact, must evaluate deeply | Full ARAW in Step 6 |
| STRONG | ✓ | Likely true/useful, worth keeping | Brief justification |
| WEAK | ~ | Low impact or likely wrong | Note why, set aside |
| ELIMINATE | ✗ | Redundant, contradicted, or irrelevant | Justify elimination |
RAPID TRIAGE:
★ CRITICAL ([N]):
- [Guess]: [1-line reason it's critical]
...
✓ STRONG ([N]):
- [Guess]: [1-line reason it's strong]
...
~ WEAK ([N]):
- [Guess]: [1-line reason it's weak]
...
✗ ELIMINATE ([N]):
- [Guess]: [1-line reason to eliminate]
...
Step 6: Deep Evaluation (CRITICAL Guesses Only)
For each CRITICAL guess, run a compressed ARAW:
GUESS: [statement]
ASSUME RIGHT (what follows if this is true/correct):
- AR1: [implication]
- AR2: [implication]
- AR3: [what you'd build/do differently]
ASSUME WRONG (what follows if this is false/incorrect):
- AW1: [implication]
- AW2: [implication]
- AW3: [what you'd build/do differently]
DIVERGENCE: [How different are the AR vs AW paths?]
- HIGH: Completely different strategies → This is a true crux, must resolve
- MED: Different approaches, same general direction → Important but not blocking
- LOW: Minor adjustments → Demote from CRITICAL to STRONG
RESOLUTION PATH: [How to determine which is true]
- [Test, experiment, question to ask, evidence to gather]
Step 7: Dependency Analysis
Check if any CRITICAL guesses depend on others:
DEPENDENCIES:
- [Guess A] depends on [Guess B]: [relationship]
- [Guess C] and [Guess D] are mutually exclusive
- [Guess E] is prerequisite for [Guess F, G, H]
RESOLUTION ORDER:
1. Resolve [Guess B] first (most dependencies downstream)
2. Then [Guess A]
3. [Guess C vs D] can be resolved independently
...
Step 8: Selection Matrix
For CRITICAL and STRONG guesses, score against criteria.
At depth ≤ 4x: Use the standard scoring matrix (1-5 per criterion, sum for total).
At depth ≥ 8x: After matrix scoring, use pairwise comparison for items within 3 points of the cutoff. For each borderline pair, ask: “Given the specific purpose of this selection, is A or B more important?” Pairwise comparison is more reliable than absolute scoring for close items.
SELECTION MATRIX:
| Guess | Actionability | Impact | Testability | Novelty | Independence | Deps | TOTAL | RANK |
|-------|---------------|--------|-------------|---------|--------------|------|-------|------|
| [G1] | 5 | 5 | 3 | 4 | 5 | 3 | 25 | 1 |
| [G2] | 4 | 5 | 4 | 3 | 4 | 2 | 22 | 2 |
...
Stability check: After scoring, identify items where a 1-point change on ANY criterion would move them in/out of the selection. Mark these as BORDERLINE.
STABILITY:
Stable selections (rank holds under ±1 perturbation): [list]
Borderline (effectively tied — rank is fragile): [list]
Step 9: Final Selection
Default to selecting the top 20 items across all tiers unless the user specifies a different number (e.g., “best 10”, “top 5”). The 20 should be distributed across tiers based on quality, not forced evenly.
Cluster coverage check: Before finalizing, verify that each significant cluster (from Step 3) has at least 1 representative in the selection. If a cluster is missing, swap in its highest-scoring member for the lowest-scoring redundant item from an over-represented cluster.
Polarity balance check (if tags present): Count the [DIR:], [USE:], and [ORIENT:] distribution in your selected items. Enforce:
- ≥40% CONSTRUCTIVE (for Interpretation 2 inputs). If under 40%, swap in the highest-scoring constructive guesses for the lowest-scoring skeptical ones.
- ≥3 OUTPUT or BOTH items. If under 3, swap in the highest-scoring output guesses for the lowest-scoring thinking-only ones.
- ≥1 each of EXTEND, REDIRECT, NEUTRAL. If any missing, swap in.
- Report the final polarity balance in the output.
Framing: Detect whether the selected guesses are primarily hypotheses/unknowns or actions/changes. If hypotheses: frame tiers as “Test first / Test after / Monitor.” If actions: frame as “Act on immediately / Act on after / Revisit later.”
SELECTED ([N]):
TIER 1 — [Test first / Act on immediately] (max 5):
1. [Guess]: [why selected, what to do next]
2. [Guess]: [why selected, what to do next]
TIER 2 — [Test after Tier 1 / Act on after Tier 1 resolved]:
3. [Guess]: [why selected, what depends on]
4. [Guess]: [why selected, what depends on]
TIER 3 — [Monitor / Revisit later]:
5. [Guess]: [why kept, when to revisit]
ELIMINATED ([N]):
- [Guess]: [final elimination reason]
...
DEFERRED ([N]) (includes knockout-deferred from Step 2):
- [Guess]: [why deferred, trigger to revisit]
...
Output Format
## SELECTION SUMMARY
Input: [what was evaluated]
Total evaluated: [N] | Knockout-deferred: [N] | Proceeded to triage: [N]
Clusters: [N] clusters identified
Critical: [N] | Strong: [N] | Weak: [N] | Eliminated: [N]
## TIER 1 SELECTIONS [Test first / Act immediately]
[Ranked list with justifications and next actions]
## TIER 2 SELECTIONS
[Ranked list with dependencies]
## TIER 3 (MONITOR)
[Items to revisit]
## RANKING STABILITY
Stable: [items whose rank holds under ±1 perturbation]
Borderline: [items that are effectively tied — rank is fragile]
Non-differentiating criteria: [criteria that didn't vary across selections]
## CLUSTER COVERAGE
[Which clusters are represented in the selection, which are not, and why]
## POLARITY BALANCE
Direction: Constructive [N] / Skeptical [N] / Diagnostic [N]
Purpose: Output [N] / Thinking [N] / Both [N]
Orientation: Extend [N] / Redirect [N] / Neutral [N]
[Note any swaps made to meet balance requirements]
## KEY CRUXES TO RESOLVE
[CRITICAL guesses with highest divergence, in resolution order]
## ELIMINATED WITH JUSTIFICATION
[What was cut and why]
Quality Checklist
Before completing:
- Knockout filter applied — deferred guesses listed with triggers
- Guesses clustered by derivation/theme
- All guesses triaged into buckets
- CRITICAL guesses received ARAW evaluation
- Dependencies identified (with downstream count per guess)
- Selection criteria defined and applied; non-differentiating criteria noted
- Stability check performed — stable vs borderline items identified
- Cluster coverage verified — every significant cluster represented
- Tiers assigned with next actions (framed as test or act)
- Eliminations justified
- Resolution order for cruxes specified
- Polarity balance checked — ≥40% constructive, ≥3 output, all orientations represented
- If imbalanced, swaps made and documented
Next Steps
After selection:
- Use
/dcpto create decision procedure for top selections - Use
/toto sequence actions from Tier 1 - Use
/arawfor deeper analysis of unresolved cruxes