Young-Jin Park

Young-Jin Park

PhD Candidate @ MIT • Graduating 2026

← Back to yjpark.log
KR EN

Stochastic Blackbox

March 20, 2026

Can You Fully Trust Code Written by an LLM?

You need to trust me completely

Claude Code handles most tasks quite well—but honestly, it still makes subtle mistakes here and there.

For example, I once asked it to generate diverse outputs using a Qwen model. Hours passed and the job hadn’t finished. When I checked, it was looping with batch size 1.

The frustrating part? When I pointed it out, it fixed the issue immediately. In other words, it knew how to do it—it just didn’t think to do it without guidance.

Strictly speaking, my prompt was ambiguous, so that’s on me. But even accepting that, the question kept nagging at me: do I have to spell out every detail in the prompt? Or review every line of generated code myself? If so, why am I using Claude in the first place?

Scalable Managing

This reminded me of a nearly identical dilemma I faced while mentoring interns at NAVER and MIT.

Early in my career, I did everything myself—no tools like Claude existed—so I understood my code and my work down to the lowest level. But once I started working with interns, I had to accept that I couldn’t review every line they wrote. Technically I could—but it wasn’t scalable. And blindly trusting their output wasn’t a sustainable strategy either.

What I naturally converged to was asking counterfactual questions about the work they brought back. How did things look under scenario A? What were the results for B? How did C differ? I often already had a rough sense of the answers. Honestly, it was less about curiosity and more about verification—checking whether the outcomes that should appear, given the approach I had in mind, actually did.

I realized the same applies to Claude: if I at least have a rough sense of what the output should look like, the reliability of using Claude’s output goes up significantly. The details don’t need to match exactly. But when something clearly doesn’t make sense, that’s a strong signal something went wrong.1


Stochastic Blackbox

At least for now, I think we should treat LLM agents as Stochastic Blackbox Tools.2

The whole point of using Claude is productivity. Reading through every line it writes defeats that purpose. (And honestly, I’m not even confident I’d catch all the bugs anyway.) So instead, why not embrace the blackbox view—and focus on how to evaluate and guarantee the reliability of its outputs?

This idea isn’t entirely new. When I worked on the Instagram Ad Ranking team, this was exactly how things operated. Engineers built models; the company designed and provided a statistically reliable evaluation tool. You ran your model through the pipeline, got a score, and that score represented the model’s reliability. The evaluation tool didn’t read your code.


Building a Playground for the Agent

Designing a good evaluation protocol takes time. But once it’s in place, it massively expands what you can safely delegate to an agent. What I’ve ultimately come to feel is:

Writing good prompts matters—but even more important is investing time upfront in clearly defining what problem you’re trying to solve, and what the output should look like when it’s solved.

Another key insight: looking only at the final output isn’t enough—you need to inspect intermediate results too. Just like dense rewards are more effective than sparse rewards in RL, LLM agents benefit from more frequent, structured evaluation signals.

In that sense, using Claude Code feels a lot like doing RL on your own agent. Your job is to design a constrained environment where the agent operates, and continuously refine the reward model within it.


A Side Note

If you’ve worked with a manager or PI, you’ve probably had this thought: “Why are they so skeptical?” No matter what result you bring, the first reaction is often doubt. If performance looks surprisingly good, instead of “Great!”, you hear “Wait, is something wrong here?”

Interestingly, high-performers anticipate this. They come prepared:

They don’t just present the final result—they’ve already validated the entire process. More than that, they structure their code and experiments so they can quickly respond to follow-up questions on the spot. This isn’t just preparedness—it’s a mindset of taking ownership over reliability.

There’s a Korean proverb: “Even if you explain it badly, a smart person understands it perfectly.”3 Similarly, as I noted in a previous post, people no longer want an i-don’t-know bot that just repeats “I’m not sure!” We want models that go beyond expressing uncertainty well—that understand our intent and make reliable decisions even in unfamiliar situations.

Claude Code hasn’t quite reached the “high-performer” stage yet, so some degree of skepticism is still warranted. That said, auditing every line of generated code isn’t the right collaboration model either.

Can You Fully Trust Code Written by an LLM?
Can You Fully Trust Code Written by an LLM?

1 I actually use this approach when evaluating my own code as well. For example, if the offline CTR for a newly built recommendation model comes out suspiciously high, my first reaction isn’t celebration—it’s suspicion. Most likely, some forbidden feature or user history has leaked into training.

2 This phrase was shared with me by Ahmad at NeurIPS in December 2025.

3 Interestingly, English has the contrasting expression: “Ask a silly question, get a silly answer.”