Why Prompt Baking is the Only Known Method for Sample Efficient On-the-Job Learning

Edited: October 20, 2025
In the rapidly evolving world of AI, one glaring limitation stands out: today's models are static artifacts, frozen at their training date. They excel at mimicking patterns from vast datasets but falter when it comes to learning on the job, like a new employee adapting to real-world nuances through experience. This blog post explores what true learning means, how humans do it effortlessly, why current AI approaches are doomed to fail, and why prompt baking emerges as the singular breakthrough for high sample efficiency learning & distributional control over LLMs. Lastly, we'll outline a practical roadmap to actualize this learning in real-world use cases.
What It Means to Learn: Humans vs. AI, and the Bitter Lesson
At its core, learning is about understanding and adapting to the world through experience. As Rich Sutton, a pioneer of reinforcement learning (RL), emphasized in his recent interview with Dwarkesh Patel, intelligence isn't about imitating others but about interacting with the environment, pursuing goals, and updating based on outcomes (YouTube). Sutton contrasts his view with the current LLM paradigm: mimicking human outputs by massively scaling pre-training & post-training without true goals or ground truth. "There's no right thing to say" in LLMs, he argues, because they lack a substantive objective like rewards in RL. Humans, on the other hand, learn continually: we try actions, observe results, and refine our internal "world model" to predict and achieve better outcomes.
Humans learn on the job through a natural, iterative process. A new claims adjuster doesn't memorize a massive dataset; they observe a senior colleague (demonstration), ask clarifying questions to explore alternate scenarios, and practice with feedback. This iterative process builds tacit knowledge: nuances like client preferences or edge cases, without forgetting core skills. Humans are sample-efficient: one correction often suffices because it integrates into our existing mental framework.
Existing AI approaches are doomed to fail at recreating this learning process. Scaling maximalists, inspired by Sutton's 2019 "Bitter Lesson" essay (Sutton), bet everything on massive compute and data to leverage Moore's Law. The maximalists pour billions into ever-larger models in the hopes of brute-forcing intelligence without needing genuine breakthroughs. Sutton's original bitter lesson said: methods that scale with computation (like search and learning) outperform human-crafted knowledge over time. Yet, here's the bitter irony Sutton himself highlights in his recent interview: massively scaling LLMs embody a perversion of this idea. Today, the maximalists scale LLMs with human knowledge (from internet text & expert-labelled data costing billions from Scale, Surge, etc.). Ilya Sutskever famously declared at NeurIPS 24' that "data is the fossil fuel of AI," implying internet-scale data is a non-sustainable resource the maximalists are rapidly exhausting (Twitter). The result is a capability plateau, with LLMs as glorified imitators lacking true experiential learning and adaptive intelligence.
The scaling maximalists, clinging to Sutton's old words while ignoring his new critiques, assume endless investment & doubling down will eventually create a singleton model that can accomplish any job, and everyone will finally agree and say "this is Artificial General Intelligence (AGI)." They forget their model depreciates every time the world changes (which is often!), or anything relevant to its job changes. Paradoxically, those betting the most on the widest, most sweeping changes to society put forth systems that are fundamentally static and unable to adapt, and don't see any issue. Their idol they call AGI is the Memento man, suffering from anterograde amnesia (short-term memory loss), woefully relying on tattoos and notes (prompts & context-based "memory") to keep track of his life.

Attempts at post-training via supervised fine-tuning (SFT) or RL reveal further dead ends: requiring large-scale data and introducing catastrophic forgetting, where new tasks overwrite old knowledge by shifting the model's distribution arbitrarily far from its base (arXiv). RL is sample-inefficient, demanding many thousands of trials, carefully designed environments, and struggling with sparse rewards or long horizons. SFT messes up distributions, leading to forgetting. For instance, a model tuned on medical data might lose general reasoning. Sutton critiques imitation-heavy paradigms (like LLMs) as non-natural; animals don't just "imitate" to learn. Instead, they experiment and adapt via trial-and-error, without schools, massive pre-training, or countless benchmarks & test/train splits. The real bitter lesson? Betting on scale past the midpoint of a sigmoid without further breakthroughs turns out an endless and disappointing money pit. True continual learning requires rethinking fundamentals, not just burning cash and crossing fingers everything will keep improving.
Andrej Karpathy and Dwarkesh Patel both agree something's missing. Karpathy, reflecting on Sutton's views, notes the irony of LLMs being hailed as "bitter lesson pilled" while relying on finite human data (Twitter). Karpathy proposes paradigms like "system prompt learning" for more human-like improvement, where models reflect on failures & digest lessons through prompts which are eventually baked into weights (Twitter), and he notes he is "bearish on reinforcement learning specifically" due to its inefficiency and difference from how humans learn (Twitter). Patel echoes this sentiment in his recent essay, arguing LLMs lack organic adaptation: "The fundamental problem is that LLMs don’t get better over time the way a human would... There’s no way to give a model high level feedback. You’re stuck with the abilities you get out of the box." He predicts a "broadly deployed intelligence explosion" once continual learning is solved, but extends timelines to 2032 for human-like on-the-job learning, emphasizing that without it, even advanced models remain 5/10 at practical tasks (Dwarkesh).
The evidence is laid bare in MIT's "GenAI Divide: State of AI in Business 2025" report, which reveals that 95% of generative AI pilots in enterprises are failing, not due to model quality but a "learning gap" where tools like LLMs fail to adapt to real workflows and stall revenue impact (Wired). The naivete of those betting on RL as an cure-all for real-world deployment challenges shines through here: RL assumes endless trials in simulated environments and is confined to verifiable domains, but real-world tasks across enterprise and startup ecosystems demand systems that learn from sparse, expert feedback in messy, dynamic settings. That's not to say RL won't boast near-term wins in challenging but verifiable domains like math and programming. After all, IMO Gold is no small feat (DeepMind). But in practice, RL’s high-sample demands make it a pipe dream for pilots where quick adaptation is key or knowledge is tribal, hard to verify, or at all qualitative.
These failures highlight exactly what current methods lack: either they're practically infeasible on-the-job, horrendously sample-inefficient compared to available data, or stuck unable to learn & improve. Enter prompt baking.
Prompt Baking: A Fundamentally Better Approach
The idea to bake a prompt into an LLM's weights was first proposed by Askell in 2021 under the name 'context distillation' (and later called 'prompt injection') (arXiv), studied further by Snell et. al. (arXiv), and independently discovered by Bhargava, et. al. in Prompt Baking (arXiv). The technique bridges prompting and weight updates by "baking" a prompt into an LLM's weights. Mathematically, it minimizes KL divergence between the prompted and unprompted model distributions, effectively embedding the prompt's behavior permanently. This self-distills prompted capabilities into durable updates in just 5 minutes. At the time Bread began, Baking represented a proof of concept for converting prompts to weight updates. Since then, we have invested in building a platform for distributed baking (enabling distillation to & from many models), infrastructure for serving inference (solving hot-swap issues with models learning at inference time), developer tooling, and a recipe-book for structuring multiple sequential or parallel bakes. Altogether these advances offer a concrete pathway to scalably learning from feedback & corrections at human speed in the real world.
First, why is the idea of baking fundamentally better than RL for continual learning? First, sample efficiency: Baking requires only one prompt per feedback or correction, mirroring human learning where a single expert note ("Flag pre-existing conditions") sticks. RL demands thousands of examples; baking amplifies a single demonstration into weight-level knowledge.
Second, scalability: Corrections stack composably. In internal tests, the Bread team has baked 1,000+ technical documents into models without losing base abilities, proving it overcomes context window limits and enables diverse knowledge integration, even outperforming RAG on particular evaluations. Bread has baked chain-of-thought prompts, brand personas, and sequentially added news headlines, eliciting generalization, synthesis, and robust recall even on indirect queries.
Third, baking naturally mitigates catastrophic forgetting. By matching distributions the model is already capable of producing, it preserves prior knowledge by not straying "off the beaten path." SFT suffers comparatively, diverging wildly and erasing capabilities. A recent paper out of MIT shows the reason why on-policy RL forgets less than SFT is because it minimizes KL shifts away from the base distribution (arXiv), but baking takes this even further: its explicit objective is minimizing the KL to match base model (prompted) distributions. But unlike RL, baking requires no rewards or answer validation, and operates on any prompt, embedding expertise without disrupting the model's core.
Actualizing On-the-Job Learning from Experts
Bread's vision transforms AI into adaptive agents that learn like human trainees, following a 3-step process mirroring expert-led onboarding. This roadmap leverages prompt baking to embed real-time expertise, enabling high sample efficiency continual learning in practical scenarios like claims adjustment or coding.
1. Demonstration: An expert performs a task (e.g., processing a claim). The transcript(s) or episode(s) get translated into multi-turn tool calls and responses, which the AI observes via prompts capturing the workflow. Bake these prompts into weights, converting the demonstration to permanent behavior. No massive datasets; natural expert walkthroughs suffice to bake LLMs that can learn like employees.
2. Questions: Probe nuances ("How to handle angry customers?"). Expert responses color-in counterfactuals & alternate branches of the action tree. Questions build judgment & robustness across a variety of circumstances and edge cases, refining behavior without forgetting step 1.
3. Practice with Feedback: The AI attempts the task and experts correct errors. Each correction is a targeted prompt, baked to update only relevant pathways. This surgical precision avoids RL's inefficiency and SFT's forgetting, yielding permanent improvements from single feedbacks at human-level sample-efficiency.
The result? An AI that embodies organizational expertise, adapting to changes via quick bakes. This converts in-context learning to weights, overcoming LLMs' static nature.
Prompt baking is the only known method for achieving this iterative 3-stage learning process because it uniquely enables (short-term semantic) in-context learning to trigger (long-term synaptic) weight updates in a surgical, composable fashion from single examples or explanations. As Sutton's bitter lesson bites again, baking positions AI for true experiential learning, unlocking adaptive, on-the-job intelligence. Schedule a demo to see it in action.