swe<<

mesocosm

A simple platform for you to submit and easily bench your environments

May 29th, 2026

Felix Wu

swe<<

Definitions

With a Wordle example

Environment: the “thing” the model is being tested insidee.g. the Wordle game

State: the environment’s current situatione.g. secret word, guesses used & remaining, past guesses

State space: every situation the environment could ever be in

Action space: what the model is allowed to doe.g. one valid 5-letter word guess

Observation: the part of the state the model can seee.g. past guesses & color — never the secret word

Episode: one single run of the enve.g. one Wordle game

EpisodeOne full run of the environment.

WORDLE

Enter guess

ENTER

secret_word = "CRANE" guesses_used = 1 guesses_remaining = 5 past_guesses = ["SLATE"]

←

EnvironmentThe task/game the model is tested inside.

←

ObservationWhat the model is allowed to see.

←

Action spaceOne valid 5-letter word guess.

←

State spaceEverything the environment knows internally.

02 / 18

swe<<

The Environment Workflow.

What happens inside the environment after each action

reset(seed) creates a fresh hidden state — secret word, guesses used, past guesses.

It returns an observation with visible information only, not the secret answer.

It receives one action from the model — e.g. a valid 5-letter guess.

step(action) validates the guess, scores each letter, and updates internal state.

It returns a StepResult: next observation, reward, done flag, and info.

If the episode is done, close() cleans up. Otherwise, the loop repeats.

1. reset(seed)

2. observation

3. action

4. step(action)

5. reward + next observation

6. done? if not, repeat

7. close()

03 / 18

swe<<

The Agent Loop.

How does Mesocosm use the model to play the environment?

Mesocosm treats the model like the Wordle player.

Bench turns the current observation into a model prompt.

The model reasons from feedback: green = fixed position, yellow = move letter, gray = avoid.

The model outputs one action: a single valid 5-letter guess.

Bench sends it into env.step(action) and records every observation, action, reward, and result as a trace.

ReACT: Observe · Think · Act · Repeat.

1. Observe

The model receives the current environment state (observation).

2. Think

The model reasons about what to do next.

3. Act

The model outputs one action from the action space.

4. Environment responds

It executes the action and returns reward + next observation.

Repeat

Continue until done, max steps, or timeout.

04 / 18

swe<<

THE PROCESS

From your idea to a
public showcase.

I. Initialize

Write your env

Create your public GitHub repo. Include: env.py, adapter.py, benchanything.json. (More on it later.)

›

II. Submit

Publish your env

Using the form or the API, submit your environment with your GitHub repo and username.

›

III. Onboard

Check the health

Status goes pending → cloning → ready. Confirm everything is healthy and ready to bench.

›

IV. Success

See the stats

Pick “Test Bench” or “Full Bench” to see how AI agents fare on your environments.

05 / 18

swe<<

THE CONTRACT

If your env can answer these
four endpoints, we can bench it.

GET /health

Is the server alive?

The sandbox waits up to 30s for a 200 OK before marking your env ready.

POST /reset

Start a new episode.

Takes a seed — pick the secret word, deal the cards, return the first observation.

POST /step

Apply one action.

Return the next observation, a reward, and whether the episode terminated or truncated.

POST /close

Tear it down.

The episode is finished. Release any state you were holding for this episode_id.

Optional: POST /render — shows what the current environment looks like

06 / 18

swe<<

LOCAL CHECKLIST

Run the four endpoints by hand
before pushing.

terminal # 1. start the adapter on a port you choose $ python adapter.py --port 8765 # 2. is it alive? $ curl -s http://localhost:8765/health {"status": "ok"} # 3. reset — same seed, same secret word $ curl -sX POST http://localhost:8765/reset -d '{"episode_id":"x","seed":42}' # 4. take a step $ curl -sX POST http://localhost:8765/step -d '{"episode_id":"x","action":"crane"}' # 5. and close it $ curl -sX POST http://localhost:8765/close -d '{"episode_id":"x"}'

07 / 18

swe<<

FILE ONE

env.py — the game/task logic.

Implements:

i. Hidden State — information the env knows but not the model
ii. Reward — a score returned after an action
iii. Termination — a condition that ends the episode
iv. reset(seed) — starts a new episode
v. step(action) — intake one action from the model

env.py

# env.py — Wordle, abridged
from src.env_sdk import BaseEnv, StepResult

class WordleEnv(BaseEnv):
    def reset(self, seed=None, **params):
        rng = random.Random(seed)
        self._secret = rng.choice(WORDLIST)
        self._guesses_used = 0
        return {"feedback": None, "guesses_remaining": 6}

    def step(self, action):
        guess = action.strip().lower()
        feedback = score_guess(guess, self._secret)
        self._guesses_used += 1
        won = guess == self._secret
        done = won or self._guesses_used >= 6
        return StepResult(
            observation={"last_guess": guess,
                         "feedback": feedback,
                         "guesses_remaining": 6 - self._guesses_used},
            reward=1.0 if won else 0.0,
            terminated=done, truncated=False,
            info={"won": "1.0" if won else "0.0"},
        )

08 / 18

swe<<

adapter.py

# adapter.py — wires your env to HTTP
import argparse
from env import WordleEnv         # local import, not
from src.env_sdk import serve     # "from template.env"

if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("--host", default="0.0.0.0")
    p.add_argument("--port", type=int, default=8765)
    args = p.parse_args()
    serve(WordleEnv, host=args.host, port=args.port)

FILE TWO

adapter.py — the server.

Functions:

i. imports your env
ii. accepts --port
iii. starts the HTTP server
iv. exposes /health, /reset, /step, /close (optional /render)

09 / 18

swe<<

FILE THREE

benchanything.json — the rulebook.

Tells us:

i. what file to run
ii. what the env is called
iii. what the agent sees
iv. what actions are valid
v. how reward works
vi. how scoring works

benchanything.json

{
  "adapter": "adapter.py",
  "name": "Wordle",
  "description": "6 tries to guess a 5-letter word.",
  "binding_vow": {
    "version": "1.0.0",
    "observation_space": { "type": "json" },
    "action_space": { "type": "text",
      "description": "A 5-letter lowercase word." },
    "reward": { "type": "binary",
      "range": { "low": 0.0, "high": 1.0 } },
    "episode": { "max_steps": 6,
      "deterministic_reset": true,
      "supports_seed": true }
  },
  "scoring": {
    "primary_metric": "win_rate",
    "higher_is_better": true,
    "metrics": [
      { "name": "win_rate", "type": "terminal_field",
        "field": "won", "aggregation": "pass_rate" }
    ]
  }
}

10 / 18

swe<<

THE PACKAGE

The files to build everything.

You only need 3 files to create an environment and submit it. An optional fourth is recommended.

wordle-env/ env.py # task logic — the game adapter.py # server wrapper · Mesocosm→env benchanything.json # manifest, rulebook requirements.txt # optional deps

integration_test.sh# integration_test.sh — pseudocode

SET BASE_URL = MESOCOSM_BASE_URL or "https://api.swecc.org/bench"
SET REPO_URL = REPO_URL or "https://github.com/FWT-bs/environments"

PRINT "Mesocosm integration test"

# 1. Check required local files
REQUIRED_FILES = [benchanything.json, env.py, adapter.py, requirements.txt]
IF any required file is missing:  PRINT warning · STOP

# 2. Validate benchanything.json
IF benchanything.json is not valid JSON:  PRINT error · STOP
ELSE:  PRINT success

# 3. Check hosted API health + OpenAPI
GET {BASE_URL}/health        →  REPORT
GET {BASE_URL}/openapi.json   →  REPORT

# 4. Submit environment repo
POST {BASE_URL}/v1/developer/environments with:
    owner_id    = "FWT-bs"
    name        = "Tic Tac Toe Smoke Test"
    github_url  = REPO_URL
    description = "Compact tic-tac-toe benchmark"
REPORT success or failure

PRINT "Done."

11 / 18

swe<<

You submit your env,
we handle the rest.

After you submit, we run your environments against agents — testing both that your environment is configured properly and the capability of the different agents.

Your ID / Handle

acme-corp

Environment Name

my coding bench

GitHub Repository URL

https://github.com/your-org/your-env

Repository must contain a benchanything.json at its root.

CLI

To install:pip install swecc-mesocosm mesocosm --versionTo eliminate setup:bench init

12 / 18

swe<<

AUTOMATED GRADING

Your env emits signal.
scoring decides the verdict.

The environment never grades itself. Each step() returns a reward and an info dict.

Your scoring block in benchanything.json turns those signals into metrics — pass/fail or a continuous score — ranked by a primary_metric.

One env, re-scored any way, without touching env code.

step() → reward + info ↓ episode → total_reward + terminal_info ↓ scoring.metrics — aggregate (mean / pass_rate …) ↓ primary_metric → leaderboard

13 / 18

swe<<

TWO FLAVORS

Pass/fail, or scored.

Same env, two scoring blocks. The aggregation is the dial.

pass_rate counts wins (reward 1.0/0.0). mean averages a graded reward in [0,1].

Raise pass_threshold to demand partial credit before something counts as a pass.

pass / fail

"scoring": {
  "primary_metric": "accuracy",
  "higher_is_better": true,
  "metrics": [
    { "name": "accuracy",
      "type": "episode_reward",
      "aggregation": "pass_rate" }
  ]
}

scored

"scoring": {
  "primary_metric": "mean_score",
  "higher_is_better": true,
  "metrics": [
    { "name": "mean_score",
      "type": "episode_reward",
      "aggregation": "mean" }
  ]
}

14 / 18

swe<<

multi-criteria rubric

"scoring": {
  "primary_metric": "success_rate",
  "higher_is_better": true,
  "metrics": [
    { "name": "success_rate",
      "type": "terminal_field", "field": "success",
      "aggregation": "pass_rate", "pass_threshold": 0.5 },
    { "name": "avg_reward",
      "type": "episode_reward", "aggregation": "mean" },
    { "name": "avg_steps",
      "type": "terminal_field", "field": "steps",
      "aggregation": "mean" }
  ]
}

RUBRICS DONE RIGHT

Combine criteria.
Guard the score.

i. Emit signal, not verdicts — let scoring decide
ii. Name a primary_metric that exists; set higher_is_better
iii. Be deterministic & seed-stable; set a real max_steps
iv. Emit clean numeric info for terminal_field metrics
v. Add secondary diagnostics to catch reward-hacking

15 / 18

swe<<

DEMO · 01

Intertwine.

Don't lose your contexts.

Capture, sync & resume your AI coding sessions
Pick up any past conversation, from any directory
Share sessions with your teammates
CLI + Cloudflare Worker + S3 — built for teams

intertwine.dev

intertwine-site.pages.dev

Simon will demo it live.

16 / 18

swe<<

DEMO · 02

KleoKlaw.

A job board you run from your texts.

Text a number to post or find a job — no app, no login
An SMS gateway routes your message to the backend
It tracks your thread and reads/writes listings
Replies come right back over text

kleoklaw.com

Humphrey will demo it live.

17 / 18

swe<<

Q & A

Open floor — questions & brainstorm.

mesocosm

Definitions

The Environment Workflow.

The Agent Loop.

1. Observe

2. Think

3. Act

4. Environment responds

Repeat

From your idea to apublic showcase.

Write your env

Publish your env

Check the health

See the stats

If your env can answer thesefour endpoints, we can bench it.

Is the server alive?

Start a new episode.

Apply one action.

Tear it down.

Run the four endpoints by handbefore pushing.

env.py — the game/task logic.

adapter.py — the server.

benchanything.json — the rulebook.

The files to build everything.

You submit your env,we handle the rest.

Your env emits signal.scoring decides the verdict.

Pass/fail, or scored.

Combine criteria.Guard the score.

Intertwine.

KleoKlaw.

Q & A

From your idea to a
public showcase.

If your env can answer these
four endpoints, we can bench it.

Run the four endpoints by hand
before pushing.

You submit your env,
we handle the rest.

Your env emits signal.
scoring decides the verdict.

Combine criteria.
Guard the score.