swe<<

mesocosm

A simple platform for you to submit and easily bench your environments

May 29th, 2026

Felix Wu

swe<<

Definitions

With a Wordle example

Environment: the “thing” the model is being tested insidee.g. the Wordle game
State: the environment’s current situatione.g. secret word, guesses used & remaining, past guesses
State space: every situation the environment could ever be in
Action space: what the model is allowed to doe.g. one valid 5-letter word guess
Observation: the part of the state the model can seee.g. past guesses & color — never the secret word
Episode: one single run of the enve.g. one Wordle game
EpisodeOne full run of the environment.
WORDLE
S
L
A
T
E
Enter guess
ENTER
secret_word = "CRANE" guesses_used = 1 guesses_remaining = 5 past_guesses = ["SLATE"]
EnvironmentThe task/game the model is tested inside.
ObservationWhat the model is allowed to see.
Action spaceOne valid 5-letter word guess.
State spaceEverything the environment knows internally.
02 / 18
swe<<

The Environment Workflow.

What happens inside the environment after each action

reset(seed) creates a fresh hidden state — secret word, guesses used, past guesses.

It returns an observation with visible information only, not the secret answer.

It receives one action from the model — e.g. a valid 5-letter guess.

step(action) validates the guess, scores each letter, and updates internal state.

It returns a StepResult: next observation, reward, done flag, and info.

If the episode is done, close() cleans up. Otherwise, the loop repeats.

1. reset(seed)
2. observation
3. action
4. step(action)
5. reward + next observation
6. done? if not, repeat
7. close()
03 / 18
swe<<

The Agent Loop.

How does Mesocosm use the model to play the environment?

Mesocosm treats the model like the Wordle player.

Bench turns the current observation into a model prompt.

The model reasons from feedback: green = fixed position, yellow = move letter, gray = avoid.

The model outputs one action: a single valid 5-letter guess.

Bench sends it into env.step(action) and records every observation, action, reward, and result as a trace.

ReACT: Observe · Think · Act · Repeat.

1. Observe

The model receives the current environment state (observation).

2. Think

The model reasons about what to do next.

3. Act

The model outputs one action from the action space.

4. Environment responds

It executes the action and returns reward + next observation.

Repeat

Continue until done, max steps, or timeout.

04 / 18
swe<<

THE PROCESS

From your idea to a
public showcase.

I. Initialize

Write your env

Create your public GitHub repo. Include: env.py, adapter.py, benchanything.json. (More on it later.)

II. Submit

Publish your env

Using the form or the API, submit your environment with your GitHub repo and username.

III. Onboard

Check the health

Status goes pending → cloning → ready. Confirm everything is healthy and ready to bench.

IV. Success

See the stats

Pick “Test Bench” or “Full Bench” to see how AI agents fare on your environments.

05 / 18
swe<<

THE CONTRACT

If your env can answer these
four endpoints, we can bench it.

GET /health

Is the server alive?

The sandbox waits up to 30s for a 200 OK before marking your env ready.

POST /reset

Start a new episode.

Takes a seed — pick the secret word, deal the cards, return the first observation.

POST /step

Apply one action.

Return the next observation, a reward, and whether the episode terminated or truncated.

POST /close

Tear it down.

The episode is finished. Release any state you were holding for this episode_id.

Optional: POST /render — shows what the current environment looks like
06 / 18
swe<<

LOCAL CHECKLIST

Run the four endpoints by hand
before pushing.

terminal # 1. start the adapter on a port you choose $ python adapter.py --port 8765 # 2. is it alive? $ curl -s http://localhost:8765/health {"status": "ok"} # 3. reset — same seed, same secret word $ curl -sX POST http://localhost:8765/reset -d '{"episode_id":"x","seed":42}' # 4. take a step $ curl -sX POST http://localhost:8765/step -d '{"episode_id":"x","action":"crane"}' # 5. and close it $ curl -sX POST http://localhost:8765/close -d '{"episode_id":"x"}'
07 / 18
swe<<

FILE ONE

env.py — the game/task logic.

Implements:

  • i. Hidden State — information the env knows but not the model
  • ii. Reward — a score returned after an action
  • iii. Termination — a condition that ends the episode
  • iv. reset(seed) — starts a new episode
  • v. step(action) — intake one action from the model
env.py
# env.py — Wordle, abridged
from src.env_sdk import BaseEnv, StepResult

class WordleEnv(BaseEnv):
    def reset(self, seed=None, **params):
        rng = random.Random(seed)
        self._secret = rng.choice(WORDLIST)
        self._guesses_used = 0
        return {"feedback": None, "guesses_remaining": 6}

    def step(self, action):
        guess = action.strip().lower()
        feedback = score_guess(guess, self._secret)
        self._guesses_used += 1
        won = guess == self._secret
        done = won or self._guesses_used >= 6
        return StepResult(
            observation={"last_guess": guess,
                         "feedback": feedback,
                         "guesses_remaining": 6 - self._guesses_used},
            reward=1.0 if won else 0.0,
            terminated=done, truncated=False,
            info={"won": "1.0" if won else "0.0"},
        )
08 / 18
swe<<
adapter.py
# adapter.py — wires your env to HTTP
import argparse
from env import WordleEnv         # local import, not
from src.env_sdk import serve     # "from template.env"

if __name__ == "__main__":
    p = argparse.ArgumentParser()
    p.add_argument("--host", default="0.0.0.0")
    p.add_argument("--port", type=int, default=8765)
    args = p.parse_args()
    serve(WordleEnv, host=args.host, port=args.port)

FILE TWO

adapter.py — the server.

Functions:

  • i. imports your env
  • ii. accepts --port
  • iii. starts the HTTP server
  • iv. exposes /health, /reset, /step, /close (optional /render)
adapter.py · the server 09 / 18
swe<<

FILE THREE

benchanything.json — the rulebook.

Tells us:

  • i. what file to run
  • ii. what the env is called
  • iii. what the agent sees
  • iv. what actions are valid
  • v. how reward works
  • vi. how scoring works
benchanything.json
{
  "adapter": "adapter.py",
  "name": "Wordle",
  "description": "6 tries to guess a 5-letter word.",
  "binding_vow": {
    "version": "1.0.0",
    "observation_space": { "type": "json" },
    "action_space": { "type": "text",
      "description": "A 5-letter lowercase word." },
    "reward": { "type": "binary",
      "range": { "low": 0.0, "high": 1.0 } },
    "episode": { "max_steps": 6,
      "deterministic_reset": true,
      "supports_seed": true }
  },
  "scoring": {
    "primary_metric": "win_rate",
    "higher_is_better": true,
    "metrics": [
      { "name": "win_rate", "type": "terminal_field",
        "field": "won", "aggregation": "pass_rate" }
    ]
  }
}
benchanything.json · the binding vow 10 / 18
swe<<

THE PACKAGE

The files to build everything.

You only need 3 files to create an environment and submit it. An optional fourth is recommended.

wordle-env/ env.py # task logic — the game adapter.py # server wrapper · Mesocosm→env benchanything.json # manifest, rulebook requirements.txt # optional deps
integration_test.sh
# integration_test.sh — pseudocode SET BASE_URL = MESOCOSM_BASE_URL or "https://api.swecc.org/bench" SET REPO_URL = REPO_URL or "https://github.com/FWT-bs/environments" PRINT "Mesocosm integration test" # 1. Check required local files REQUIRED_FILES = [benchanything.json, env.py, adapter.py, requirements.txt] IF any required file is missing: PRINT warning · STOP # 2. Validate benchanything.json IF benchanything.json is not valid JSON: PRINT error · STOP ELSE: PRINT success # 3. Check hosted API health + OpenAPI GET {BASE_URL}/health → REPORT GET {BASE_URL}/openapi.json → REPORT # 4. Submit environment repo POST {BASE_URL}/v1/developer/environments with: owner_id = "FWT-bs" name = "Tic Tac Toe Smoke Test" github_url = REPO_URL description = "Compact tic-tac-toe benchmark" REPORT success or failure PRINT "Done."
11 / 18
swe<<

You submit your env,
we handle the rest.

After you submit, we run your environments against agents — testing both that your environment is configured properly and the capability of the different agents.

UI

Your ID / Handle
acme-corp
Environment Name
my coding bench
GitHub Repository URL
https://github.com/your-org/your-env
Repository must contain a benchanything.json at its root.

CLI

To install:pip install swecc-mesocosm mesocosm --versionTo eliminate setup:bench init
12 / 18
swe<<

AUTOMATED GRADING

Your env emits signal.
scoring decides the verdict.

The environment never grades itself. Each step() returns a reward and an info dict.

Your scoring block in benchanything.json turns those signals into metrics — pass/fail or a continuous score — ranked by a primary_metric.

One env, re-scored any way, without touching env code.

step() → reward + info episode → total_reward + terminal_info scoring.metrics — aggregate (mean / pass_rate …) primary_metric → leaderboard
13 / 18
swe<<

TWO FLAVORS

Pass/fail, or scored.

Same env, two scoring blocks. The aggregation is the dial.

pass_rate counts wins (reward 1.0/0.0). mean averages a graded reward in [0,1].

Raise pass_threshold to demand partial credit before something counts as a pass.

pass / fail
"scoring": {
  "primary_metric": "accuracy",
  "higher_is_better": true,
  "metrics": [
    { "name": "accuracy",
      "type": "episode_reward",
      "aggregation": "pass_rate" }
  ]
}
scored
"scoring": {
  "primary_metric": "mean_score",
  "higher_is_better": true,
  "metrics": [
    { "name": "mean_score",
      "type": "episode_reward",
      "aggregation": "mean" }
  ]
}
14 / 18
swe<<
multi-criteria rubric
"scoring": {
  "primary_metric": "success_rate",
  "higher_is_better": true,
  "metrics": [
    { "name": "success_rate",
      "type": "terminal_field", "field": "success",
      "aggregation": "pass_rate", "pass_threshold": 0.5 },
    { "name": "avg_reward",
      "type": "episode_reward", "aggregation": "mean" },
    { "name": "avg_steps",
      "type": "terminal_field", "field": "steps",
      "aggregation": "mean" }
  ]
}

RUBRICS DONE RIGHT

Combine criteria.
Guard the score.

  • i. Emit signal, not verdicts — let scoring decide
  • ii. Name a primary_metric that exists; set higher_is_better
  • iii. Be deterministic & seed-stable; set a real max_steps
  • iv. Emit clean numeric info for terminal_field metrics
  • v. Add secondary diagnostics to catch reward-hacking
benchanything.json · scoring 15 / 18
swe<<

DEMO · 01

Intertwine.

Don't lose your contexts.

  • Capture, sync & resume your AI coding sessions
  • Pick up any past conversation, from any directory
  • Share sessions with your teammates
  • CLI + Cloudflare Worker + S3 — built for teams
intertwine.dev
intertwine.dev
intertwine-site.pages.dev
intertwine-site.pages.dev

Simon will demo it live.

16 / 18
swe<<

DEMO · 02

KleoKlaw.

A job board you run from your texts.

  • Text a number to post or find a job — no app, no login
  • An SMS gateway routes your message to the backend
  • It tracks your thread and reads/writes listings
  • Replies come right back over text
kleoklaw.com
kleoklaw.com

Humphrey will demo it live.

17 / 18
swe<<

Q & A

Open floor — questions & brainstorm.