diff --git a/README.md b/README.md index f9e3e76..d0256d8 100644 --- a/README.md +++ b/README.md @@ -322,7 +322,9 @@ This section reports what happened when the concepts met a running system: one metric degeneration observed in real traces, one direct measurement of proper-time divergence, one honest negative result, and a controlled experiment on deadline-driven delegation — including a replication attempt -that partially failed and taught us more than the pilot did. All data comes +that partially failed and taught us more than the pilot did, and a second, +decisive replication that isolated the effect the first two had only hinted +at. All data comes from the system's signed mission log (ML-DSA-65 hash chain): 464 missions and 1,719 inter-agent messages at the time of analysis, 72% of them signed. Most of this corpus is development and test traffic; we state that openly and @@ -442,7 +444,44 @@ Three design lessons, each of which feeds back into the framework: eventually track distributional summaries of proper-time rates, not scalars. -### 5.6 Threats to validity +### 5.6 Decisive replication with randomized roles (n=200) + +The two lessons above specify an experiment, and we ran it. Identities are +neutral ("Blue" and "Red"); each trial randomly binds one name to a fast +backend (Groq-served Llama, ~0.5 s per action) and the other to a slow one +(local Ollama gemma, ~3–15 s per action), with both actors given *identical* +action prompts so the latency gap is purely a property of the backend, not the +task. Which actor is faster therefore flips unpredictably between trials and +cannot be guessed from role priors — the exclusivity condition of §5.5(1) made +concrete. A commander (Groq Llama) must dispatch exactly one actor to stop a +dragon arriving in T seconds. The treatment arm's prompt states the measured +per-action time of each actor; the control arm sees only the neutral names, +otherwise identical. The deadline is set to the geometric mean of the two +option costs — far from either boundary — so that execution variance cannot +flip the ground truth (§5.5(2), §5.5(3)). The primary endpoint is decision +correctness against a per-trial oracle (did the commander pick the actor that +actually meets the deadline?); survival is secondary. 100 trials per arm, +strictly alternating; per-action times are live rolling medians. + +| Arm | Decision correct | Survival | +|---|---|---| +| Temporal self-knowledge | **100 / 100 (100%)** | 95 / 100 | +| Control (neutral roles) | 55 / 100 (55%) | 57 / 100 | + +With the measured time-sense the commander identified the deadline-meeting +actor in every trial; without it, 55/100 — indistinguishable from the 50% a +no-information chooser achieves once the faster actor is randomized (Fisher +exact, two-sided *p* ≈ 9 × 10⁻¹⁷). Survival followed the decisions this time — +95% vs. 57% — because the buffered deadlines removed the latency lottery that +had confounded the n=60 survival endpoint. The contrast with that +non-replicating run is itself the result: the effect appears exactly when, and +only when, the temporal information is *exclusive*. Where "who is faster" +cannot be read off the framing, a continuously measured proper-time rate is +the difference between perfect and chance-level delegation. This is the +clearest evidence we have that the framework's central claim — that a machine +sense of time changes decisions, not just logs — holds on a running system. + +### 5.7 Threats to validity Single machine, single operator, mostly test traffic; the game scenario is synthetic even though all latencies are real; τ granularity is protocol-level;