nullpointerette

teaching a gpt to judge itself, part one

nullpointerette — Tue, 30 Sep 2025 21:19:04 GMT

last time, i laid out the theory: toy datasets, self-critique inspiration, and the idea of tracing evaluations. this week, i actually built it.

the mvp version of reasonsaver is a pipeline that takes prompts, runs completions, scores them, and saves evaluations — all in one flow. i wanted it to feel like more than a notebook experiment, so i wrapped it in a cli tool and proved it could scale beyond just a few samples.

building the flow

i started small: json files in, json files out. prompts.json → completions.json → evaluations.json. on top of that i added a lightweight scoring function using hugging face embeddings. the idea was to measure cosine similarity against a “reasoning anchor” baseline.

once that worked, i wrapped it with argparse so i could call it straight from the terminal. this small shift — typing python cli.py evaluate instead of rerunning cells — made it feel like a real tool instead of a sandbox.

observability matters

from the start, i wanted to know what was happening under the hood. i wired in opentelemetry spans, creating a parent span for each evaluation run and child spans for scoring and logging. i also added structured json logs with loguru so every step left a reproducible trail.

watching trace ids pop up in the terminal, and saving logs that linked prompts to scores, was more satisfying than i expected. it proved that evaluation could be transparent, not just opaque “model magic.”

scaling up

the toy runs went fine, but i needed to stress test. the eli5 dataset i planned to use had been deprecated (boo), so i pivoted to squad. with that, i pushed the pipeline through ten thousand test cases. the results: it held up. logs and spans captured the entire run, and the outputs were reproducible.

along the way i hit a few roadblocks: numpy 2.0 broke pytorch compatibility, so i downgraded to 1.x. i also had to add batch flags to avoid burning tokens on massive runs. though, running 10K test cases didn’t cost as much as i thought, maybe like $12-15, but still..it would have taken a while to go over all 10K runs, and i wanted to save myself the waiting.

what this means

reasonsaver is no longer just a sketch of an idea. it’s a working pipeline with observability built in, benchmarks at scale, and room to grow.

from a systems lens, it taught me how to combine llm evaluation with the same practices you’d expect in production infra: logging, tracing, reproducibility. from a career lens, it sharpened into a clear line on my resume: engineered a fault-tolerant evaluation pipeline for multi-candidate llm completions, integrated pytorch embeddings for reranking, and added structured logging with opentelemetry spans for observability across 10k+ test cases.

next steps

the foundation is here. what’s next: multi-candidate reranking, a fastapi endpoint for programmatic access, self-critique scoring chains, and eventually piping spans into jaeger or grafana dashboards.

openai and anthropic might run billion-token evaluations, but the same principles can exist at a smaller, personal scale. reasonsaver is my way of shrinking those patterns into something lightweight, but real.

part zero was about inspiration. this part is about proof.

stay tuned for part two. ✨

my first blocknote contribution: closing the popover

nullpointerette — Wed, 10 Sep 2025 19:44:16 GMT

i’ve always loved tools that make you forget there’s code underneath. everything feels clean, fluid, inevitable… the kind of products where it seems like ui just melts away.

but of course, every smooth interaction is hiding about a hundred tiny edge cases. wanting to learn more about what makes a smooth product tick, i finally peeked behind the curtain and decided to contribute to blocknote, a notion-style text editor. i’ve been a notion fan for years, and blocknote felt like the perfect way to learn how these kinds of editors actually work, while giving something back to the tools i enjoy using.

finding an issue

i didn’t start with a grand feature idea. i scrolled through open issues and saw #1696: “link popover does not close when using static formatting toolbar.”

low priority, and labeled “good first issue.” couldn’t have been more perfect.

setting up

blocknote is a pnpm/nx monorepo with packages for core, react, and multiple ui layers (mantine, shadcn, ariakit). the static toolbar lives in mantine.

it took me a hot minute (and A LOT of broken builds) to get the docs site running locally. i even gutted half the demo pages just to isolate the one example i needed: static formatting toolbar. it was super janky, but it worked.

the fix

the difference between the floating toolbar (which was the working one) and the static one (which didn’t) came down to state management:

the floating toolbar’s plugin explicitly closes the popover after submit
while static toolbar never calls setOpened(false) and therefore stays frozen in time and space.

so the fix was literally like five lines in CreateLinkButton.tsx:

editLink={(url) => {
  update(url);
  setOpened(false);
  editor.focus();
}}

it’s a tiny change, but now the popover closes like it should.

demo

before: link applied, popover stayed open.
after: link applied, popover closes.

what i learned

monorepos feel like a maze at first, but you get good at pattern-spotting fast.
many bugs are really just “someone forgot to update state.”
you don’t need a huge feature to start contributing to repos, even small ui fixes matter.

it’s a tiny change, but it feels huge. i went from “i don’t even know where to start” to opening a pr that closes a bug.

makes me appreciate how much thought goes into these little interactions. that’s the kind of invisible polish i’ve always loved about products like figma, duolingo, slack, notion.

this isn’t the end, though! i’m so excited to dive in even deeper into blocknote. ✨

teaching a gpt to judge itself, part zero

nullpointerette — Mon, 25 Aug 2025 00:26:11 GMT

lately, i’ve been obsessed with how companies are evaluating their large language models. openai has evals, which is “a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.” anthropic has written about "constitutional ai,” where they have models critique their own reasoning.

why not humans? well, according to them, having humans judge the correctness of a model from several answers has some “shortcomings.” one obvious reason is that is doesn’t scale very well. llm’s need A LOT of data to work with, and humans simply can’t keep up with the pace an ai needs to be well-trained. there are some ethical reasons, too, like requiring humans to interact with a number of disturbing content. that’s why anthropic uses self-critique and fine-tuning techniques for ai’s to do the work themselves.

all of this got me thinking - what if i tried to replicate this myself? and how exactly would monitoring these evaluations work, too? obviously i don’t have access to openai’s resources. i’m not training foundational models or fine-tuning with billions of tokens. but the core ideas of structured evaluation datasets, self-critique, and observability can definitely be explored at a smaller scale.

so after doing some research, i found some helpful resources:

hugging face tutorials show how to turn text into embeddings, aka representing language as vectors so you can measure similarity or cluster outputs.
even infra companies like datadog and honeycomb write about observability: using traces and spans to log what’s happening inside complex systems.
i also found opentelemetry tracing guide + best practices super helpful; it explains spans and traces in a way you can actually apply on a toy project.

i’m not training billion-parameter models or shipping production infra. but the patterns-structured datasets (openai), self-critique (anthropic), embeddings (hugging face), and traces (datadog/honeycomb) - felt like something i could shrink down into a side project.

and that’s where reasonsaver comes in. it’s my sandbox for shrinking these big-company ideas into something lightweight and personal. i’d use toy datasets instead of benchmarks, or a tiny pytorch embedder instead of sentence-bert, and a baby trace.py instead of datadog dashboards. but that doesn’t mean evaluation can’t be powerful.

first, i’d lay the foundation: toy datasets to prove the flow, a baseline ranking strategy, and trace spans to make the process observable.

next, i’ll scale it up: integrating a lightweight pytorch embedder (hugging face–style), running across 10k+ structured test cases, and layering in self-critique scoring chains inspired by anthropic. the goal is to measure improvements in reasoning accuracy and make evaluation results reproducible - the same way openai does with evals, just at a smaller but still meaningful scale.

reasonsaver isn’t a throwaway experiment, though. it’s a hands-on framework for exploring how modern ai companies test their systems, and a way to sharpen the evaluation and infra instincts i’ll need as an engineer.

stay tuned for more! ✨

why notion's "offline mode" is a big deal.

nullpointerette — Sat, 23 Aug 2025 01:44:51 GMT

disclaimer: while i am a huge notion fan, i am not an employee (well, for now, anyway 😏), so this is just my best guess at what’s going on under the hood. still, as someone who geeks out over infra, i couldn’t resist talking about why this particular release matters.

first, the old world.

for most of its life, notion was online-first, designed to only really work if you had the internet. every keystroke you made streamed to the backend, which acted as the single source of truth. this means that when you typed a letter, the edit was immediately sent over the network to notion’s servers. the app passed your data along, and the official, “real” version of your notes lived in the cloud.

so basically, if you were offline:

your device didn’t have a full copy of your workspace
edits couldn’t sync
you’d see things like blank pages or “loading” because the app needed to wait for the server to tell it what the document was supposed to look like

now i can only make assumptions, but im sure a few people behind the scenes at notion said something like: “hmm..let’s do something about that.” and the project to make notion work offline was born.

next, the new shift.

to make offline mode a reality, the team probably needed to reconcile a core problem: syncing states.

as we discussed, when you’re online, every notion-motion travels to the backend, updates the database, and syncs back to other clients. offline breaks that assumption. suddenly:

your edits need to be stored locally, not just in memory
changes need to be merged back into the cloud later
conflicts need to be handled gracefully (ex. two people editing the same page in different states)

this isn’t just “cache some data.” it’s a replicated state machine problem— making sure local copies and the source of truth eventually agree, without losing the user’s work. to do that, you’d need to rethink how the editor worked, at a fundamental level.

that’s where crdts come in.

as notion’s ceo ivan zhao explained, the team solved this with one of the largest production crdt systems ever built.

crdts, or conflict-free replicated data types, are special data structures that make collaboration possible even when edits happen out of order.

here’s the trick: no matter what order changes arrive in, crdts guarantee the document will eventually converge to the same result.

🧩 think of it like lego instructions:

one person adds a tower on the left
another person adds a tower on the right
even if they build them in different orders, when the instructions are merged, the final castle still has both towers.

that’s what makes crdts so powerful for text editing. everyone can edit offline, and when they reconnect, all the edits merge without overwriting each other.

notion’s team adopted peritext — a crdt framework from the research lab ink & switch, designed specifically for rich text. and according to notion engineers, this rollout is one of the largest crdt deployments in history.

this is way more than “offline mode.” it’s a deep re-architecture of how the editor works at its core.

the building blocks

to pull this off, the team needed a whole new stack of infra:

local storage layer: large workspaces mean storing potentially gigabytes of content on-device. that requires efficient serialization, compression, and eviction policies.
sync engine: a system to track deltas (what changed) instead of re-uploading entire pages. this keeps sync fast and bandwidth-light.
conflict resolution: merges need to be automatic most of the time — but with a clear ux when human decisions are required. think crdts (conflict-free replicated data types) or operational transforms.
consistency guarantees: notion’s promise is collaboration without surprises. that means prioritizing eventual consistency while making it feel real-time.
resilience at scale: millions of clients coming online and syncing at once = load spikes. infra has to scale, throttle, and prioritize safely.

why it matters

reliability & trust: notion is now “always there,” like virtual pen & paper.
accessibility: unlocks usage in low-connectivity regions.
infra flex: solving crdts + sync at scale shows notion isn’t just a design tool, it’s a distributed systems company.
future-proofing: the same infra powers faster load times, mobile-first workflows, and even edge-aware collaboration down the line.

the bigger picture

offline mode shows notion is willing to do the deep, boring, hard engineering work that nobody sees, but everyone feels. this is what separates a pretty app from a real platform.

common W for notion! ✨