A simple platform for you to submit and easily bench your environments
May 29th, 2026
Felix Wu
With a Wordle example
What happens inside the environment after each action
reset(seed) creates a fresh hidden state — secret word, guesses used, past guesses.
It returns an observation with visible information only, not the secret answer.
It receives one action from the model — e.g. a valid 5-letter guess.
step(action) validates the guess, scores each letter, and updates internal state.
It returns a StepResult: next observation, reward, done flag, and info.
If the episode is done, close() cleans up. Otherwise, the loop repeats.
How does Mesocosm use the model to play the environment?
Mesocosm treats the model like the Wordle player.
Bench turns the current observation into a model prompt.
The model reasons from feedback: green = fixed position, yellow = move letter, gray = avoid.
The model outputs one action: a single valid 5-letter guess.
Bench sends it into env.step(action) and records every observation, action, reward, and result as a trace.
ReACT: Observe · Think · Act · Repeat.
The model receives the current environment state (observation).
The model reasons about what to do next.
The model outputs one action from the action space.
It executes the action and returns reward + next observation.
Continue until done, max steps, or timeout.
THE PROCESS
Create your public GitHub repo. Include: env.py, adapter.py, benchanything.json. (More on it later.)
Using the form or the API, submit your environment with your GitHub repo and username.
Status goes pending → cloning → ready. Confirm everything is healthy and ready to bench.
Pick “Test Bench” or “Full Bench” to see how AI agents fare on your environments.
THE CONTRACT
The sandbox waits up to 30s for a 200 OK before marking your env ready.
Takes a seed — pick the secret word, deal the cards, return the first observation.
Return the next observation, a reward, and whether the episode terminated or truncated.
The episode is finished. Release any state you were holding for this episode_id.
LOCAL CHECKLIST
FILE ONE
Implements:
# env.py — Wordle, abridged
from src.env_sdk import BaseEnv, StepResult
class WordleEnv(BaseEnv):
def reset(self, seed=None, **params):
rng = random.Random(seed)
self._secret = rng.choice(WORDLIST)
self._guesses_used = 0
return {"feedback": None, "guesses_remaining": 6}
def step(self, action):
guess = action.strip().lower()
feedback = score_guess(guess, self._secret)
self._guesses_used += 1
won = guess == self._secret
done = won or self._guesses_used >= 6
return StepResult(
observation={"last_guess": guess,
"feedback": feedback,
"guesses_remaining": 6 - self._guesses_used},
reward=1.0 if won else 0.0,
terminated=done, truncated=False,
info={"won": "1.0" if won else "0.0"},
)
# adapter.py — wires your env to HTTP
import argparse
from env import WordleEnv # local import, not
from src.env_sdk import serve # "from template.env"
if __name__ == "__main__":
p = argparse.ArgumentParser()
p.add_argument("--host", default="0.0.0.0")
p.add_argument("--port", type=int, default=8765)
args = p.parse_args()
serve(WordleEnv, host=args.host, port=args.port)
FILE TWO
Functions:
FILE THREE
Tells us:
{
"adapter": "adapter.py",
"name": "Wordle",
"description": "6 tries to guess a 5-letter word.",
"binding_vow": {
"version": "1.0.0",
"observation_space": { "type": "json" },
"action_space": { "type": "text",
"description": "A 5-letter lowercase word." },
"reward": { "type": "binary",
"range": { "low": 0.0, "high": 1.0 } },
"episode": { "max_steps": 6,
"deterministic_reset": true,
"supports_seed": true }
},
"scoring": {
"primary_metric": "win_rate",
"higher_is_better": true,
"metrics": [
{ "name": "win_rate", "type": "terminal_field",
"field": "won", "aggregation": "pass_rate" }
]
}
}
THE PACKAGE
You only need 3 files to create an environment and submit it. An optional fourth is recommended.
After you submit, we run your environments against agents — testing both that your environment is configured properly and the capability of the different agents.
UI
CLI
AUTOMATED GRADING
The environment never grades itself. Each step() returns a reward and an info dict.
Your scoring block in benchanything.json turns those signals into metrics — pass/fail or a continuous score — ranked by a primary_metric.
One env, re-scored any way, without touching env code.
TWO FLAVORS
Same env, two scoring blocks. The aggregation is the dial.
pass_rate counts wins (reward 1.0/0.0). mean averages a graded reward in [0,1].
Raise pass_threshold to demand partial credit before something counts as a pass.
"scoring": {
"primary_metric": "accuracy",
"higher_is_better": true,
"metrics": [
{ "name": "accuracy",
"type": "episode_reward",
"aggregation": "pass_rate" }
]
}
"scoring": {
"primary_metric": "mean_score",
"higher_is_better": true,
"metrics": [
{ "name": "mean_score",
"type": "episode_reward",
"aggregation": "mean" }
]
}
"scoring": {
"primary_metric": "success_rate",
"higher_is_better": true,
"metrics": [
{ "name": "success_rate",
"type": "terminal_field", "field": "success",
"aggregation": "pass_rate", "pass_threshold": 0.5 },
{ "name": "avg_reward",
"type": "episode_reward", "aggregation": "mean" },
{ "name": "avg_steps",
"type": "terminal_field", "field": "steps",
"aggregation": "mean" }
]
}
RUBRICS DONE RIGHT
DEMO · 01
Don't lose your contexts.


Simon will demo it live.
DEMO · 02
A job board you run from your texts.


Humphrey will demo it live.
Open floor — questions & brainstorm.