AI Engineering

15 min

4.9

Introduction

Nova: Picture this. It's 2025. A software engineer spends a weekend building an AI chatbot demo using GPT-4. It works beautifully. They're thrilled. Then they spend the next four months trying to get it from 80 percent good to 95 percent good. Sound familiar? That gap — between a cool demo and a production-ready AI application — is exactly what Chip Huyen's book "AI Engineering" is all about. And today, we're diving deep into why this book has become the most-read title on the entire O'Reilly platform since its release.

Nova: That's the central question Huyen tackles across 500-plus pages. She draws on her experience at NVIDIA, where she was a core developer on their GenAI framework NeMo, plus her time at Snorkel AI, Netflix, and teaching at Stanford. And her core argument is this: we've entered a new era where the question isn't "how do we build models" anymore — it's "how do we build products that use models." That shift changes everything.

Nova: Not at all. Reviewers consistently describe it as a blueprint, a handbook, a definitive guide. One reviewer called it "codifying the craft." Huyen herself says the more overwhelming a space is, the more important it is to have a framework to navigate it. And that's what she delivers: a ten-chapter framework covering everything from understanding foundation models to prompt engineering, RAG, fine-tuning, dataset engineering, inference optimization, and production architecture.

Nova: Let's start with the most fundamental shift she describes — the birth of AI engineering as its own discipline, distinct from traditional machine learning engineering.

Why AI Engineering Is Not Just ML Engineering 2.0

The New Discipline

Nova: Huyen opens the book by tracing a fascinating evolution. We went from simple language models to large language models thanks to a training approach called self-supervision — where models learn from vast amounts of unlabeled internet text without expensive human annotation. Then these models incorporated images, audio, and other data types to become what we now call foundation models. And then something crucial happened: companies like OpenAI and Google started offering these models as a service through APIs.

Nova: Exactly. And that democratization created an entirely new role. Huyen draws a sharp distinction: traditional ML engineering was model-centric. Engineers spent most of their time on feature engineering, model training, and building models from scratch. AI engineering, by contrast, is application-centric. It's about model adaptation — using techniques like prompt engineering, RAG, and fine-tuning to adapt powerful existing models to specific problems.

Nova: That's exactly what Huyen argues. She says the workflow is faster and more iterative. AI engineers often build a product first using a model API, get user feedback, and only then invest in custom data or fine-tuning. This makes the field much more accessible to developers from a web or full-stack background. One reviewer noted that Huyen even argues the pivot from frontend to AI is easier than most people think — no heavy math required.

Nova: That's one of the book's sharpest insights. Huyen says there are three possible moats: technology, distribution, and data. Technology is commoditized because everyone has access to the same foundation models. Distribution is usually owned by big companies. So for startups, the real competitive advantage is the data flywheel — getting to market first, gathering proprietary user data, and using that to continuously improve the product.

Nova: Great question. Huyen introduces Microsoft's Crawl-Walk-Run framework. Crawl means human involvement is mandatory — AI provides suggestions but a human makes the final call. Walk means AI can interact directly with internal employees. Run means AI interacts directly with external customers. The key is being intentional about which stage you're at and not jumping to full automation before you're ready.

Nova: And that pragmatism is what sets this book apart. Huyen is neither a hype merchant nor a doomer. She's an engineer who wants to help you ship reliable products. Which brings us to what she calls the single most important — and most difficult — part of AI engineering: evaluation.

Why Testing AI Is the Hardest Problem in the Field

The Evaluation Obsession

Nova: Huyen dedicates two full chapters to evaluation, and she says it's one of the hardest but most important topics she's ever written about. Her argument is blunt: not having a reliable evaluation pipeline is one of the biggest blockers to AI adoption.

Nova: Several reasons. First, foundation models produce open-ended outputs. There's no single correct answer for tasks like summarizing a document or writing an email. Second, most proprietary models are black boxes — you don't know their training data or architecture. Third, the benchmarks are getting saturated. Models are improving so fast they're achieving near-perfect scores on existing tests, so you constantly need new, harder ones.

Nova: Huyen lays out a whole taxonomy. There's functional correctness — like unit tests for code generation, where you can actually execute the generated code and check if it works. There are similarity measurements against reference data, using metrics like BLEU and ROUGE for lexical similarity, or embedding-based semantic similarity. And then there's the controversial but increasingly common approach: AI as a judge.

Nova: It has real limitations, and Huyen is upfront about them. AI judges are inconsistent because they're probabilistic. They exhibit self-bias — GPT-4 gives itself a 10 percent higher win rate when judging its own outputs. They have position bias, often favoring whichever answer they see first. And they have verbosity bias, preferring longer answers even when they contain errors.

Nova: Exactly. Huyen says they should always be supplemented with exact evaluation, human evaluation, or both. But she also introduces a fascinating alternative: comparative evaluation. Instead of asking "how good is this response on a scale of one to ten," you ask "which of these two responses is better?" It's the same principle behind chess rankings and the popular LMSYS Chatbot Arena. Humans and AI judges both find comparative judgments much easier and more reliable.

Nova: That's exactly the analogy. And Huyen introduces a concept she calls Evaluation-Driven Development — inspired by test-driven development in software engineering. Define how you'll evaluate before you start building. Break your criteria into buckets: domain-specific capability, generation capability, instruction-following, cost, and latency. Create detailed scoring rubrics with concrete examples. Validate those rubrics with humans.

Nova: Precisely. And she emphasizes that the goal isn't 100 percent coverage — that would lead to overfitting. It's about having a systematic way to detect failures and benchmark progress. Without it, you're just doing what practitioners call "vibe checks" — and vibe checks don't scale.

Prompt Engineering, RAG, and the Rise of AI Agents

The Adaptation Toolkit

Nova: Once you have your evaluation framework, the next question is: how do you actually get these models to do what you want? Huyen walks through a progression. You start with prompt engineering, then move to retrieval-augmented generation, then agents, and only then consider fine-tuning.

Nova: She describes it as human-AI communication. Anyone can communicate, but not everyone can communicate well. Prompt engineering is easy to get started, which misleads many into thinking it's easy to do well. She covers the anatomy of a prompt, why in-context learning works, and best practices like providing clear instructions with examples. Simple tricks like asking the model to "think step by step" can yield surprising improvements.

Nova: That's where RAG comes in — retrieval-augmented generation. It's a two-step process: first retrieve relevant information from external memory, then use that information to generate more accurate responses. Huyen emphasizes that RAG was originally developed to overcome context window limitations, but it remains necessary even as context windows grow because data always grows faster.

Nova: That's a great way to put it. And the quality of your RAG system depends heavily on your retriever. Huyen contrasts two approaches: term-based retrieval, like BM25, which is fast and cheap and provides strong baselines, and embedding-based retrieval, which uses vector search for semantic understanding but is more expensive. She recommends hybrid search — using term-based for initial candidate fetching and embedding-based for re-ranking.

Nova: Yes. Huyen describes RAG as a special case of an agent where the retriever is the tool. But agents can do much more. An AI agent has an environment, tools it can access, and a planning capability. It can decompose complex tasks into steps, execute them, reflect on the results, and correct errors. She discusses the ReAct pattern — Reason, Act, Observe, Reflect — as a framework for multi-step agent loops.

Nova: That's one of her key warnings. Each step in an agent's plan has a failure risk, and errors compound across multi-step plans. The more tools you give a model, the more capable it becomes, but also the more catastrophic its failures can be. She stresses that rigorous defensive mechanisms and human-in-the-loop oversight are critical. Tool use also exposes agents to all the prompt injection and security risks she covers in the prompt engineering chapter.

Nova: Huyen breaks memory into three types. Internal knowledge is what's baked into the model weights. Short-term memory is the context window — the current conversation. Long-term memory is external data accessed through retrieval — databases, files, previous conversations. She discusses strategies like FIFO, redundancy removal through summarization, and reflection-based updates. Memory is what enables personalization and consistency across sessions.

Fine-tuning, Dataset Engineering, and the Art of Teaching Models

When Prompts Aren't Enough

Nova: Huyen says fine-tuning is the chapter that was hardest to write, and I can see why. It touches on everything from transfer learning to low-rank factorization to model merging. But her core message is counterintuitive: fine-tuning should be your last resort, not your first move.

Nova: Most AI applications can achieve great results using prompt engineering and RAG alone. Fine-tuning is for when those approaches hit their limits. Huyen walks through specific reasons to fine-tune — like teaching a model a new skill, improving consistency on a narrow task, or reducing latency and cost by using a smaller fine-tuned model instead of a giant general-purpose one. But she's equally clear about reasons not to fine-tune: it requires high-quality data, it's computationally expensive, and it creates a model you now have to maintain and update.

Nova: Exactly. And when you do decide to fine-tune, Huyen dives deep into LoRA — Low-Rank Adaptation — which has become the go-to technique. LoRA is parameter-efficient, meaning it only updates a small number of weights instead of the entire model. It's also modular — you can have multiple LoRA adapters for different tasks and swap them in and out. This makes serving much more practical.

Nova: That's chapter eight — dataset engineering — and Huyen says it's where the real creativity happens. The principles are straightforward: quality, coverage, and quantity. But executing on them is incredibly hard. A small amount of high-quality data can outperform a large amount of noisy data. Increasing diversity is often the key to improving performance. And because acquiring high-quality human-annotated data is so expensive, many teams are turning to synthetic data — using AI to generate training data for AI.

Nova: It's a real concern. Huyen notes that AI models trained on AI-generated content can degrade in performance over time. She emphasizes that synthetic data must be evaluated just as rigorously as real data. And she makes a point that really stuck with me: you can automate data generation, but you can't automate thinking through what data you want. You can't automate annotation guidelines. You can't automate paying attention to details.

Nova: That's the paradox. The most technical part of AI engineering — creating training data — is also the most human. It requires judgment, creativity, and meticulous attention to what behaviors you want your model to learn and what edge cases you need to cover.

Inference Optimization, Architecture, and the Feedback Loop

From Model to Product

Nova: The final two chapters of the book tackle what happens after you've adapted your model: how do you serve it efficiently, and how do you build a complete application around it?

Nova: Huyen acknowledges that most application developers will use model APIs with built-in optimization rather than implementing these techniques themselves. But she argues that understanding what's possible helps you evaluate the efficiency of different API providers. She covers model-level techniques like quantization — reducing the number of bits needed to represent each value — and distillation, as well as inference-service-level techniques like batching, parallelism, and prompt caching.

Nova: She highlights quantization as the most broadly impactful, along with tensor parallelism for reducing latency, replica parallelism for handling more requests, and attention mechanism optimization for accelerating transformer models. But the choice depends on your workload. KV caching matters more for long contexts. Prompt caching is crucial for multi-turn conversations. There's always a tradeoff between latency and cost.

Nova: Chapter ten presents a common AI application architecture — the model gateway, the inference service, guardrails, monitoring, and the feedback loop. Huyen emphasizes that each additional component makes your system more capable but also increases complexity and introduces new failure modes. Observability becomes critical: understanding how your system fails, designing metrics around those failures, and ensuring failures are detectable and traceable.

Nova: It's one of her most interesting points. Traditionally, user feedback design has been seen as a product responsibility, not an engineering one. But Huyen argues that because user feedback is a crucial source of data for continuously improving AI models, AI engineers need to be involved in designing how feedback is collected. This reinforces her thesis from chapter one: compared to traditional ML engineering, AI engineering is moving closer to product.

Nova: That's exactly the profile Huyen paints. And she closes with a reflection on the incredible collective energy in this space — the constant stream of new techniques, discoveries, and engineering feats. Her book doesn't try to capture every latest trend. Instead, it provides a framework for navigating the chaos. As she puts it: the more overwhelming a space is, the more important it is to have a framework.

Conclusion

Nova: So let's bring it all together. Chip Huyen's "AI Engineering" makes a compelling case that we're living through a fundamental shift in how software gets built. The availability of foundation models through APIs has created a new discipline — one that's less about training models from scratch and more about adapting, evaluating, and productizing them.

Nova: That's it exactly. The book's most quoted line captures this perfectly: "We are no longer asking how to build models; we're asking how to build products that use models." It's a shift from research-first to product-first thinking. And Huyen provides the blueprint.

Nova: Read it with a notebook. As one reviewer put it, this isn't a book you read once and shelve. The frameworks — for evaluation, for choosing between prompt engineering and fine-tuning, for designing datasets, for architecting production systems — are reference material you'll come back to. And pair it with hands-on practice. Build a small project using a foundation model API and apply the book's recommendations for evaluation, safety, and iteration.

Nova: That's exactly who it's for. Whether you're a backend engineer looking to transition into AI, an ML engineer wanting to get better at productionization, a tech lead guiding an AI team, or even a technical product manager — there's something here for you. The field is moving incredibly fast, but the engineering discipline Huyen codifies will outlast any individual model release.

Nova: Thanks for the great questions, Aster. And to our listeners: if you're building with AI, or thinking about it, this book deserves a spot on your desk — not just your shelf.

00:00/00:00