Skip to main content

Command Palette

Search for a command to run...

teaching a gpt to judge itself, part zero

baby datasets, baselines, and traces

Published
3 min read

lately, i’ve been obsessed with how companies are evaluating their large language models. openai has evals, which is “a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.” anthropic has written about "constitutional ai,” where they have models critique their own reasoning.

why not humans? well, according to them, having humans judge the correctness of a model from several answers has some “shortcomings.” one obvious reason is that is doesn’t scale very well. llm’s need A LOT of data to work with, and humans simply can’t keep up with the pace an ai needs to be well-trained. there are some ethical reasons, too, like requiring humans to interact with a number of disturbing content. that’s why anthropic uses self-critique and fine-tuning techniques for ai’s to do the work themselves.

all of this got me thinking - what if i tried to replicate this myself? and how exactly would monitoring these evaluations work, too? obviously i don’t have access to openai’s resources. i’m not training foundational models or fine-tuning with billions of tokens. but the core ideas of structured evaluation datasets, self-critique, and observability can definitely be explored at a smaller scale.

so after doing some research, i found some helpful resources:

i’m not training billion-parameter models or shipping production infra. but the patterns-structured datasets (openai), self-critique (anthropic), embeddings (hugging face), and traces (datadog/honeycomb) - felt like something i could shrink down into a side project.

and that’s where reasonsaver comes in. it’s my sandbox for shrinking these big-company ideas into something lightweight and personal. i’d use toy datasets instead of benchmarks, or a tiny pytorch embedder instead of sentence-bert, and a baby trace.py instead of datadog dashboards. but that doesn’t mean evaluation can’t be powerful.

first, i’d lay the foundation: toy datasets to prove the flow, a baseline ranking strategy, and trace spans to make the process observable.

next, i’ll scale it up: integrating a lightweight pytorch embedder (hugging face–style), running across 10k+ structured test cases, and layering in self-critique scoring chains inspired by anthropic. the goal is to measure improvements in reasoning accuracy and make evaluation results reproducible - the same way openai does with evals, just at a smaller but still meaningful scale.

reasonsaver isn’t a throwaway experiment, though. it’s a hands-on framework for exploring how modern ai companies test their systems, and a way to sharpen the evaluation and infra instincts i’ll need as an engineer.

stay tuned for more! ✨

reason-saver: built by devs with a dream to make GPT think better.

Part 1 of 2

i’m building a reasoning lab to judge gpt’s logic harder than I judge my own life. reason-saver logs, scores, and critiques completions. this is my journey through llm tooling, prompt debugging, and other ml chaos.

Up next

teaching a gpt to judge itself, part one

baby pipelines, observability, and scale