teaching a gpt to judge itself, part one
baby pipelines, observability, and scale
last time, i laid out the theory: toy datasets, self-critique inspiration, and the idea of tracing evaluations. this week, i actually built it.
the mvp version of reasonsaver is a pipeline that takes prompts, runs completions, scores them, and saves evaluations — all in one flow. i wanted it to feel like more than a notebook experiment, so i wrapped it in a cli tool and proved it could scale beyond just a few samples.
building the flow
i started small: json files in, json files out. prompts.json → completions.json → evaluations.json. on top of that i added a lightweight scoring function using hugging face embeddings. the idea was to measure cosine similarity against a “reasoning anchor” baseline.
once that worked, i wrapped it with argparse so i could call it straight from the terminal. this small shift — typing python cli.py evaluate instead of rerunning cells — made it feel like a real tool instead of a sandbox.
observability matters
from the start, i wanted to know what was happening under the hood. i wired in opentelemetry spans, creating a parent span for each evaluation run and child spans for scoring and logging. i also added structured json logs with loguru so every step left a reproducible trail.
watching trace ids pop up in the terminal, and saving logs that linked prompts to scores, was more satisfying than i expected. it proved that evaluation could be transparent, not just opaque “model magic.”
scaling up
the toy runs went fine, but i needed to stress test. the eli5 dataset i planned to use had been deprecated (boo), so i pivoted to squad. with that, i pushed the pipeline through ten thousand test cases. the results: it held up. logs and spans captured the entire run, and the outputs were reproducible.
along the way i hit a few roadblocks: numpy 2.0 broke pytorch compatibility, so i downgraded to 1.x. i also had to add batch flags to avoid burning tokens on massive runs. though, running 10K test cases didn’t cost as much as i thought, maybe like $12-15, but still..it would have taken a while to go over all 10K runs, and i wanted to save myself the waiting.
what this means
reasonsaver is no longer just a sketch of an idea. it’s a working pipeline with observability built in, benchmarks at scale, and room to grow.
from a systems lens, it taught me how to combine llm evaluation with the same practices you’d expect in production infra: logging, tracing, reproducibility. from a career lens, it sharpened into a clear line on my resume: engineered a fault-tolerant evaluation pipeline for multi-candidate llm completions, integrated pytorch embeddings for reranking, and added structured logging with opentelemetry spans for observability across 10k+ test cases.
next steps
the foundation is here. what’s next: multi-candidate reranking, a fastapi endpoint for programmatic access, self-critique scoring chains, and eventually piping spans into jaeger or grafana dashboards.
openai and anthropic might run billion-token evaluations, but the same principles can exist at a smaller, personal scale. reasonsaver is my way of shrinking those patterns into something lightweight, but real.
part zero was about inspiration. this part is about proof.
stay tuned for part two. ✨
