teaching a gpt to judge itself, part zero
baby datasets, baselines, and traces
lately, i’ve been obsessed with how companies are evaluating their large language models. openai has evals, which is “a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.” anthropic has written about "constitutional ai,” where they have models critique their own reasoning.
why not humans? well, according to them, having humans judge the correctness of a model from several answers has some “shortcomings.” one obvious reason is that is doesn’t scale very well. llm’s need A LOT of data to work with, and humans simply can’t keep up with the pace an ai needs to be well-trained. there are some ethical reasons, too, like requiring humans to interact with a number of disturbing content. that’s why anthropic uses self-critique and fine-tuning techniques for ai’s to do the work themselves.
all of this got me thinking - what if i tried to replicate this myself? and how exactly would monitoring these evaluations work, too? obviously i don’t have access to openai’s resources. i’m not training foundational models or fine-tuning with billions of tokens. but the core ideas of structured evaluation datasets, self-critique, and observability can definitely be explored at a smaller scale.
so after doing some research, i found some helpful resources:
hugging face tutorials show how to turn text into embeddings, aka representing language as vectors so you can measure similarity or cluster outputs.
even infra companies like datadog and honeycomb write about observability: using traces and spans to log what’s happening inside complex systems.
i also found opentelemetry tracing guide + best practices super helpful; it explains spans and traces in a way you can actually apply on a toy project.
i’m not training billion-parameter models or shipping production infra. but the patterns-structured datasets (openai), self-critique (anthropic), embeddings (hugging face), and traces (datadog/honeycomb) - felt like something i could shrink down into a side project.
and that’s where reasonsaver comes in. it’s my sandbox for shrinking these big-company ideas into something lightweight and personal. i’d use toy datasets instead of benchmarks, or a tiny pytorch embedder instead of sentence-bert, and a baby trace.py instead of datadog dashboards. but that doesn’t mean evaluation can’t be powerful.
first, i’d lay the foundation: toy datasets to prove the flow, a baseline ranking strategy, and trace spans to make the process observable.
next, i’ll scale it up: integrating a lightweight pytorch embedder (hugging face–style), running across 10k+ structured test cases, and layering in self-critique scoring chains inspired by anthropic. the goal is to measure improvements in reasoning accuracy and make evaluation results reproducible - the same way openai does with evals, just at a smaller but still meaningful scale.
reasonsaver isn’t a throwaway experiment, though. it’s a hands-on framework for exploring how modern ai companies test their systems, and a way to sharpen the evaluation and infra instincts i’ll need as an engineer.
stay tuned for more! ✨
