§5.6: entscheidende Replikation (n=200, randomisierte Rollen)
CDC 100/100 korrekt vs Kontrolle 55/100; Survival 95% vs 57%. Fisher exact two-sided p ≈ 9e-17 (nachgerechnet: 8.9e-17). Threats-to-Validity zu §5.7 umnummeriert; §5-Intro um die zweite Replikation ergänzt. Zahlen gegen experiments/dragon5-results.json verifiziert. §5.3-Topologie-Absatz unverändert belassen. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
71bdbe9236
commit
8d129cb003
1 changed files with 41 additions and 2 deletions
43
README.md
43
README.md
|
|
@ -322,7 +322,9 @@ This section reports what happened when the concepts met a running system:
|
||||||
one metric degeneration observed in real traces, one direct measurement of
|
one metric degeneration observed in real traces, one direct measurement of
|
||||||
proper-time divergence, one honest negative result, and a controlled
|
proper-time divergence, one honest negative result, and a controlled
|
||||||
experiment on deadline-driven delegation — including a replication attempt
|
experiment on deadline-driven delegation — including a replication attempt
|
||||||
that partially failed and taught us more than the pilot did. All data comes
|
that partially failed and taught us more than the pilot did, and a second,
|
||||||
|
decisive replication that isolated the effect the first two had only hinted
|
||||||
|
at. All data comes
|
||||||
from the system's signed mission log (ML-DSA-65 hash chain): 464 missions and
|
from the system's signed mission log (ML-DSA-65 hash chain): 464 missions and
|
||||||
1,719 inter-agent messages at the time of analysis, 72% of them signed. Most
|
1,719 inter-agent messages at the time of analysis, 72% of them signed. Most
|
||||||
of this corpus is development and test traffic; we state that openly and
|
of this corpus is development and test traffic; we state that openly and
|
||||||
|
|
@ -442,7 +444,44 @@ Three design lessons, each of which feeds back into the framework:
|
||||||
eventually track distributional summaries of proper-time rates, not
|
eventually track distributional summaries of proper-time rates, not
|
||||||
scalars.
|
scalars.
|
||||||
|
|
||||||
### 5.6 Threats to validity
|
### 5.6 Decisive replication with randomized roles (n=200)
|
||||||
|
|
||||||
|
The two lessons above specify an experiment, and we ran it. Identities are
|
||||||
|
neutral ("Blue" and "Red"); each trial randomly binds one name to a fast
|
||||||
|
backend (Groq-served Llama, ~0.5 s per action) and the other to a slow one
|
||||||
|
(local Ollama gemma, ~3–15 s per action), with both actors given *identical*
|
||||||
|
action prompts so the latency gap is purely a property of the backend, not the
|
||||||
|
task. Which actor is faster therefore flips unpredictably between trials and
|
||||||
|
cannot be guessed from role priors — the exclusivity condition of §5.5(1) made
|
||||||
|
concrete. A commander (Groq Llama) must dispatch exactly one actor to stop a
|
||||||
|
dragon arriving in T seconds. The treatment arm's prompt states the measured
|
||||||
|
per-action time of each actor; the control arm sees only the neutral names,
|
||||||
|
otherwise identical. The deadline is set to the geometric mean of the two
|
||||||
|
option costs — far from either boundary — so that execution variance cannot
|
||||||
|
flip the ground truth (§5.5(2), §5.5(3)). The primary endpoint is decision
|
||||||
|
correctness against a per-trial oracle (did the commander pick the actor that
|
||||||
|
actually meets the deadline?); survival is secondary. 100 trials per arm,
|
||||||
|
strictly alternating; per-action times are live rolling medians.
|
||||||
|
|
||||||
|
| Arm | Decision correct | Survival |
|
||||||
|
|---|---|---|
|
||||||
|
| Temporal self-knowledge | **100 / 100 (100%)** | 95 / 100 |
|
||||||
|
| Control (neutral roles) | 55 / 100 (55%) | 57 / 100 |
|
||||||
|
|
||||||
|
With the measured time-sense the commander identified the deadline-meeting
|
||||||
|
actor in every trial; without it, 55/100 — indistinguishable from the 50% a
|
||||||
|
no-information chooser achieves once the faster actor is randomized (Fisher
|
||||||
|
exact, two-sided *p* ≈ 9 × 10⁻¹⁷). Survival followed the decisions this time —
|
||||||
|
95% vs. 57% — because the buffered deadlines removed the latency lottery that
|
||||||
|
had confounded the n=60 survival endpoint. The contrast with that
|
||||||
|
non-replicating run is itself the result: the effect appears exactly when, and
|
||||||
|
only when, the temporal information is *exclusive*. Where "who is faster"
|
||||||
|
cannot be read off the framing, a continuously measured proper-time rate is
|
||||||
|
the difference between perfect and chance-level delegation. This is the
|
||||||
|
clearest evidence we have that the framework's central claim — that a machine
|
||||||
|
sense of time changes decisions, not just logs — holds on a running system.
|
||||||
|
|
||||||
|
### 5.7 Threats to validity
|
||||||
|
|
||||||
Single machine, single operator, mostly test traffic; the game scenario is
|
Single machine, single operator, mostly test traffic; the game scenario is
|
||||||
synthetic even though all latencies are real; τ granularity is protocol-level;
|
synthetic even though all latencies are real; τ granularity is protocol-level;
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue