Skip to main content

Command Palette

Search for a command to run...

teaching a gpt to judge itself, part one

baby pipelines, observability, and scale

Published
3 min read

last time, i laid out the theory: toy datasets, self-critique inspiration, and the idea of tracing evaluations. this week, i actually built it.

the mvp version of reasonsaver is a pipeline that takes prompts, runs completions, scores them, and saves evaluations — all in one flow. i wanted it to feel like more than a notebook experiment, so i wrapped it in a cli tool and proved it could scale beyond just a few samples.

building the flow

i started small: json files in, json files out. prompts.json → completions.json → evaluations.json. on top of that i added a lightweight scoring function using hugging face embeddings. the idea was to measure cosine similarity against a “reasoning anchor” baseline.

once that worked, i wrapped it with argparse so i could call it straight from the terminal. this small shift — typing python cli.py evaluate instead of rerunning cells — made it feel like a real tool instead of a sandbox.

observability matters

from the start, i wanted to know what was happening under the hood. i wired in opentelemetry spans, creating a parent span for each evaluation run and child spans for scoring and logging. i also added structured json logs with loguru so every step left a reproducible trail.

watching trace ids pop up in the terminal, and saving logs that linked prompts to scores, was more satisfying than i expected. it proved that evaluation could be transparent, not just opaque “model magic.”

scaling up

the toy runs went fine, but i needed to stress test. the eli5 dataset i planned to use had been deprecated (boo), so i pivoted to squad. with that, i pushed the pipeline through ten thousand test cases. the results: it held up. logs and spans captured the entire run, and the outputs were reproducible.

along the way i hit a few roadblocks: numpy 2.0 broke pytorch compatibility, so i downgraded to 1.x. i also had to add batch flags to avoid burning tokens on massive runs. though, running 10K test cases didn’t cost as much as i thought, maybe like $12-15, but still..it would have taken a while to go over all 10K runs, and i wanted to save myself the waiting.

what this means

reasonsaver is no longer just a sketch of an idea. it’s a working pipeline with observability built in, benchmarks at scale, and room to grow.

from a systems lens, it taught me how to combine llm evaluation with the same practices you’d expect in production infra: logging, tracing, reproducibility. from a career lens, it sharpened into a clear line on my resume: engineered a fault-tolerant evaluation pipeline for multi-candidate llm completions, integrated pytorch embeddings for reranking, and added structured logging with opentelemetry spans for observability across 10k+ test cases.

next steps

the foundation is here. what’s next: multi-candidate reranking, a fastapi endpoint for programmatic access, self-critique scoring chains, and eventually piping spans into jaeger or grafana dashboards.

openai and anthropic might run billion-token evaluations, but the same principles can exist at a smaller, personal scale. reasonsaver is my way of shrinking those patterns into something lightweight, but real.

part zero was about inspiration. this part is about proof.

stay tuned for part two. ✨

reason-saver: built by devs with a dream to make GPT think better.

Part 2 of 2

i’m building a reasoning lab to judge gpt’s logic harder than I judge my own life. reason-saver logs, scores, and critiques completions. this is my journey through llm tooling, prompt debugging, and other ml chaos.

Start from the beginning

teaching a gpt to judge itself, part zero

baby datasets, baselines, and traces