docs: section 5 Preliminary Evaluation (metric degeneration, measured divergence, dragon experiment incl. non-replication) + renumber 6/7 + LogpyClaw source link

This commit is contained in:
Dilles 2026-06-10 11:16:58 +02:00
parent d269720a52
commit 85e601c11c

139
README.md
View file

@ -67,7 +67,7 @@ per agent alongside an EWMA pace estimate, and cross-faction drift is
classified as expected or anomalous before it is logged.
- Project background: <https://www.dillenberg.net/agentclaw-lokales-multi-agent-ki-system/>
- Source code: GitHub release in preparation — link will appear here.
- Source code: <https://github.com/Jeuners/logpyclaw>
---
@ -104,7 +104,7 @@ This is not a bug to be patched with better clock synchronization. It is a struc
The question, then, is not how to eliminate dilation. It cannot be eliminated without sacrificing the heterogeneity that makes such systems useful. The question is how to design systems that remain *coherent* in its presence.
This paper develops the analogy in five steps. Section 2 establishes why standard synchronization techniques from distributed systems fall short for LLM agents. Section 3 develops a conceptual framework for agent proper time, drawing on phenomenological accounts of internal time-consciousness as well as the relativistic notion of Eigenzeit, and proposes a transformation between agent reference frames. Section 4 grounds the framework in the AgentClaw case study, identifying four distinct sources of dilation in a running system and showing where dilation matters and where it can be safely ignored. Section 5 discusses implications for logging, debugging, and the user's trust in temporally opaque agent decisions. Section 6 concludes with a brief positioning of the work within a broader programme of *Sovereign Temporal Continuity* — the proposition that autonomous systems must be designed to remain coherent across time in a way that survives their original architects.
This paper develops the analogy in six steps. Section 2 establishes why standard synchronization techniques from distributed systems fall short for LLM agents. Section 3 develops a conceptual framework for agent proper time, drawing on phenomenological accounts of internal time-consciousness as well as the relativistic notion of Eigenzeit, and proposes a transformation between agent reference frames. Section 4 grounds the framework in the AgentClaw case study, identifying four distinct sources of dilation in a running system and showing where dilation matters and where it can be safely ignored. Section 5 reports a preliminary empirical evaluation on the running reference implementation. Section 6 discusses implications for logging, debugging, and the user's trust in temporally opaque agent decisions. Section 7 concludes with a brief positioning of the work within a broader programme of *Sovereign Temporal Continuity* — the proposition that autonomous systems must be designed to remain coherent across time in a way that survives their original architects.
---
@ -315,9 +315,140 @@ These capabilities are not new in the abstract. Distributed databases have offer
---
## §5 Implications — *to be written*
## 5. Preliminary Evaluation
## §6 Conclusion: Sovereign Temporal Continuity — *to be written*
The framework is implemented in LogpyClaw v3 (see *Reference Implementation*).
This section reports what happened when the concepts met a running system:
one metric degeneration observed in real traces, one direct measurement of
proper-time divergence, one honest negative result, and a controlled
experiment on deadline-driven delegation — including a replication attempt
that partially failed and taught us more than the pilot did. All data comes
from the system's signed mission log (ML-DSA-65 hash chain): 464 missions and
1,719 inter-agent messages at the time of analysis, 72% of them signed. Most
of this corpus is development and test traffic; we state that openly and
treat the numbers accordingly. Experiment scripts and raw results are
published alongside the implementation (`experiments/dragon*.py`).
### 5.1 A naive rate metric degenerates in practice
The first implementation approximated each agent's pace as a lifetime average
(operations completed divided by uptime). Across 1,697 legacy messages this
metric collapsed: median recorded rates of 0.0010.003 ops/s for every agent,
with idle agents drifting asymptotically toward zero. Apparent "dilation
spreads" of five orders of magnitude between agents turned out to be
artifacts of the metric, not properties of the system. This is direct
empirical support for separating the two quantities the framework defines:
cumulative proper time τ (monotonic, merged by max) and instantaneous pace
(an EWMA over recent operations, merged by causal recency). A single number
conflating them measures uptime, not experience.
### 5.2 Proper-time divergence is real and measurable
Once the τ/pace separation went live, ordinary missions immediately exhibited
the phenomenon §1 predicts. Three orchestration missions routing work from a
fast coordinator (Groq-served Llama) to a slow worker (Claude Opus via CLI):
| Mission | Wall time | τ coordinator | τ worker | Ratio |
|---|---|---|---|---|
| `mis_274e87fe` | 384.5 s | 6.0 | 1.0 | 6.0× |
| `mis_d18a03bc` | 144.1 s | 10.0 | 3.0 | 3.3× |
| `mis_4783a34e` | 600.0 s | 4.0 | 2.0 | 2.0× |
Identical wall-clock windows, up to 6× divergence in lived time. Caveat: τ
here counts protocol-level operations (dispatch, handle, delegation ticks),
not LLM reasoning steps; the granularity is coarser than the ideal of §3.2.
### 5.3 An honest negative result
All 849 classifiable request/response pairs in the corpus relate as ORDERED;
no CAUSAL_DRIFT and no INCONSISTENT was observed. This is expected rather
than disconfirming: sequential dispatch produces causal order by
construction. The interesting relations (CONCURRENT_DRIFT and the
faction-aware reclassifications) require genuinely parallel branches, which
the orchestrator only recently gained. The classifier has not yet met the
traffic it was built for. We flag this as the primary gap between
implementation and validation.
### 5.4 Experiment: does temporal self-knowledge change decisions?
To test whether proper-time awareness changes *decisions* rather than just
logs, we built a real-time delegation scenario on the live system. A slow
agent (the "knight", a local Ollama model, ~68 s per action) must save a
player from a dragon arriving in T real seconds. It chooses between acting
itself (two of its own actions) or delegating to a fast agent (the "mage",
Groq-served, ~0.4 s per action; one knight action to call, one mage action to
cast, sometimes plus an announced exhaustion cooldown that makes delegation
slower than acting). The chosen option is then *actually executed* against
the wall clock; survival means finishing before T. In half the trials the
decision prompt contains the measured per-action times of both agents
("temporal self-knowledge"); the other half receives an otherwise identical
prompt. The cooldown, when present, is stated in both conditions — only the
*rates* are exclusive to the treatment arm.
**Pilot (n=20).** Survival 5/10 with temporal context vs. 3/10 without.
Against a post-hoc oracle computed from observed true costs, the context arm
decided 7/7 winnable trials correctly, the control arm 3/5. The only two
trials lost *through a wrong choice* both occurred in the control arm. A
methodological by-product: the injected time sense was itself miscalibrated
by ~9× (one-shot measurement with short prompts vs. real action costs) and
still helped — the decision only required the ordinal fact that the mage is
faster. The 9× drift of a static self-estimate is precisely the failure mode
§3 predicts, and motivates continuously updated rates.
**Scaled run (n=60, improved calibration).** With rolling per-action medians
(the EWMA principle at action granularity) and deadlines drawn from observed
costs, the survival effect did **not** replicate: 18/30 with context vs.
21/30 without. Decomposing the trials explains why, and the decomposition is
more instructive than the pilot:
- *Trials without cooldown* (delegation obviously optimal): both arms chose
delegation in 33/33 trials. The ordinal fact "the mage is faster" was
inferable from the scenario framing alone; the treatment information was
never exclusive, so it could not produce a difference.
- *Trials with cooldown* (the arithmetic flips): the context arm switched
correctly to acting itself in 12/14 trials, the control arm in 8/13 —
directionally consistent with the pilot, exactly where the information was
exclusive. (Small samples; we do not claim significance.)
- *Why survival still favored the control arm*: 9 deaths in the context arm
occurred despite an estimate-correct choice, versus 5 in the control arm.
The knight's latency is heavy-tailed; deadlines drawn near the decision
boundary turn correctly chosen self-action into a coin flip on latency
spikes. The arm that more often correctly chose the expensive option was
punished more often by execution variance. Survival, as an endpoint,
measured the latency lottery rather than the decision.
### 5.5 What the experiment taught us
Three design lessons, each of which feeds back into the framework:
1. **Exclusivity.** A time-sense can only show value where temporal facts are
not inferable from static framing. Future runs must randomize *who* is
faster, so that one memorized bit cannot substitute for measurement.
2. **Endpoint choice.** Decision correctness, not survival, is the primary
endpoint a time-sense controls; outcome metrics are confounded by
execution variance.
3. **Point estimates are not a time sense.** A median is not a Bauchgefühl.
The variance-driven deaths show that useful temporal self-knowledge must
carry dispersion, not just central tendency — an agent should know that it
*usually* makes it in 12 seconds, and how wide "usually" is. This extends
the framework: the dilation component of the Causal-Dilation Clock should
eventually track distributional summaries of proper-time rates, not
scalars.
### 5.6 Threats to validity
Single machine, single operator, mostly test traffic; the game scenario is
synthetic even though all latencies are real; τ granularity is protocol-level;
sample sizes are small. The evaluation is preliminary by design: its purpose
is to demonstrate that the framework's claims are *testable on a running
system*, and to report the first such tests — including the parts that did
not work — honestly.
---
## §6 Implications — *to be written*
## §7 Conclusion: Sovereign Temporal Continuity — *to be written*
---