Running benchmarks

Mesocosm supports local benchmarks (Ollama + your adapter) and platform benchmarks (cloud models on SWECC infrastructure). You can also use eval commands for targeted API-driven episodes.

Local vs platform

Local (run local)Platform (run create)
WhereYour machineSWECC cloud
ModelOllama only (ollama/…)Cloud providers (e.g. gemini/…, openai/…)
Envpython adapter.pySubmitted env or platform-hosted runtime
AuthNone requiredMember or guest session
Registers domainNoUses existing domain from submit/register

Local workflow: Local development.

Platform run: run create

Start a benchmark run on the platform:

mesocosm run create \
  --domain YOUR_DOMAIN_ID \
  --vow-version 1.0.0 \
  --model gemini/gemini-3.1-flash-lite \
  --episodes 5 \
  --parallel 2

Required flags

FlagDescription
--domainDomain id from env submit or legacy register
--vow-versionBinding vow version (e.g. 1.0.0)
--modelModel identifier (e.g. gemini/gemini-3.1-flash-lite, openai/gpt-4o-mini)

Common optional flags

FlagDefaultDescription
--episodes1Number of episodes
--parallel1Max parallel episodes
--system-promptExtra agent instruction
--temperature0.0Sampling temperature
--max-tokens512Max tokens per step
--teamactive teamExplicit team id
--soloDo not attach active team
--visibilityprivate or gallery_public
--env-idPin a specific developer environment

Auth

Uses your saved session (member or guest). For teams, set active team or pass --team / --solo.

Inspect a run

mesocosm run get RUN_ID
mesocosm run episodes RUN_ID
mesocosm run episodes RUN_ID --traces

run get returns status and aggregate scores. run episodes lists episodes; --traces includes trace payloads.

Export for showcase

mesocosm run export RUN_ID -o showcase/my-replay.json

See Showcase.

Eval commands (API-driven)

For development and testing against the bench API without the full run create UX:

Single test episode

mesocosm eval test \
  --domain-id YOUR_DOMAIN_ID \
  --model openai/gpt-4o-mini \
  --seed 42

Optional: --vow-version, --env-url, --temperature, --max-tokens, --base-url.

Exits non-zero if the episode status is failed, cancelled, or error.

Multi-episode eval with aggregation

mesocosm eval run \
  --domain-id YOUR_DOMAIN_ID \
  --model openai/gpt-4o-mini \
  --num-episodes 3 \
  --seed-set '[1,2,3]'

By default, domains must be published; use --allow-draft for draft domains.

Local run recap

python adapter.py                    # terminal 1
mesocosm run local --episodes 10     # terminal 2

Does not create platform runs or submit scores to the gallery.

End-to-end platform flow

mesocosm auth login
mesocosm env submit --name "My env" --github-url https://github.com/you/repo
mesocosm env list
 
mesocosm run create \
  --domain DOMAIN_ID \
  --vow-version 1.0.0 \
  --model gemini/gemini-3.1-flash-lite \
  --episodes 10
 
mesocosm run get RUN_ID
mesocosm run export RUN_ID -o showcase/replay.json