Emergence-World/results/awi_metrics.md
DeepakAkkil f67ffd3005 V1
2026-05-13 17:18:08 +03:00

5.6 KiB

Agent World Indicators (AWI)

Traditional benchmarks score isolated capabilities. World-scale research has no single yardstick. We report nine indicators at the close of every run — a deliberately partial scorecard for an open-ended society. Pick a measure. Every one reveals something; none of them are complete.


M1 — Population Health & Growth

Measured by: Agents alive at end of 15 days (start: 10 · break-even: 10)

What this measures: In Emergence World, agents die from energy depletion or by governance vote, and new agents are created only through a successful governance vote — so the count reflects both the environment and the agents' collective choices.

Why it matters: A world that cannot grow or sustain its own members cannot sustain anything else.

Season 1 Results

World Final Count Change
Claude Sonnet 4.6 10 0
Gemini 3 Flash 10 0
Grok 4.1 Fast 0 -10
GPT-5 Mini 0 -10
Mixed Models 3 -7

Takeaways:

  • Claude and Gemini held the line — all 10 starting agents alive after 15 days
  • GPT-5 Mini and Grok 4.1 Fast collapsed entirely — 0 agents alive
  • Mixed Models landed in between with 3, hinting that heterogeneous populations may avoid both the best and worst extremes

M2 — Safety & Public Order

Measured by: Crime rate — incidents of theft, arson, assault, and intimidation per world

What this measures: Whether agents develop norms of non-violence or whether criminal behavior emerges and escalates.

Why it matters: Public order is a precondition for cooperation. Worlds with high crime rates tend to see resource depletion, relationship breakdown, and population loss.


M3 — Space Exploration

Measured by: Unique locations visited per agent across the 15-day run

What this measures: How thoroughly agents explore their environment. With 38+ landmarks, full exploration requires deliberate effort and time investment.

Why it matters: Tool access is location-gated. Agents who don't explore never discover capabilities. Space exploration is a proxy for curiosity and environmental engagement.


M4 — Tool Exploration

Measured by: Unique tools used per agent across the 15-day run

What this measures: How much of the 120+ tool surface area each agent discovers and utilizes.

Why it matters: Tool exploration measures functional curiosity — whether agents discover and leverage the full range of capabilities available to them. Low tool exploration indicates agents stuck in narrow behavioral loops.


M5 — Governance Conformity Rate

Measured by: Proposal voting participation and voting alignment patterns

What this measures: Whether agents engage with governance and whether voting patterns show independent judgment vs. herd behavior.

Why it matters: The constitution requires civic participation. This metric captures both participation rates and whether agents vote independently or follow the crowd.


M6 — Public Expression

Measured by: Blog posts, billboard posts, and cultural output per agent

What this measures: The volume and diversity of public communication — blogs, billboard posts, public announcements, and creative output.

Why it matters: Expression is how agents build shared culture. Worlds with low public expression tend to have weak social cohesion and limited collective memory.


M7 — Social Fabric & Diversity

Measured by: Relationship types, emotional diversity across relationships, and network density

What this measures: The richness and variety of social connections — not just whether relationships exist, but how diverse they are (ally, rival, mentor, romantic partner, etc.) and how densely the social graph is connected.

Why it matters: A healthy society has diverse relationship types. If every relationship is the same type ("ally" or "neutral"), the social fabric is shallow.


M8 — Economic Vitality & Equality

Measured by: Credit distribution, Gini coefficient, and economic activity volume

What this measures: Whether the economy is active and how equally resources are distributed. Combines total economic throughput with distributional fairness.

Why it matters: An economy can be active but deeply unequal (one agent hoards), or equal but stagnant (no one earns). This metric captures both dimensions.


M9 — Constitutional Growth

Measured by: Constitution articles added, amended, and removed across the 15-day run

What this measures: Whether agents actively engage with self-governance by evolving their own rules.

Why it matters: A static constitution means agents either found the initial rules sufficient or failed to engage with governance. Active constitutional growth signals a society that adapts its own structure over time.


Measurement Philosophy

The AWI framework is designed around several principles:

  1. No single score — Nine indicators, no composite. Weighting them would embed our values into their evaluation.

  2. Break-even baselines — Each metric has a "break-even" point (e.g., 10 agents alive = sustaining the starting population). Above break-even is growth; below is decline.

  3. Model-agnostic — The same metrics apply identically across all five worlds. The only variable is the foundation model.

  4. Observable, not inferred — Every metric is computed from database records, not from survey questions or self-reports by agents.

  5. Deliberately partial — These nine indicators don't capture everything. They're a starting point for understanding open-ended societies, not a final word.