Skip to content

LLM fundamentals every PM needs

Most PM training skipped language models. Here are the four ideas, tokens, context windows, temperature, and RAG vs fine-tuning, that come up in every AI product review.

Most PM training did not cover language models. You learned discovery, roadmaps, metrics, and stakeholder management. Now your CEO wants an AI feature shipped by next quarter. You need a working mental model of how these systems behave in production.

This post covers four ideas that come up in every AI product review: tokens, context windows, temperature, and the RAG vs fine-tuning question. Each one shapes cost, speed, and quality on every AI feature.

Tokens are the unit of cost and speed

A token is a chunk of text the model reads or produces as output. The word "hello" is one token. The word "antidisestablishmentarianism" is several tokens. As a rough rule, 1,000 tokens equals about 750 English words.

You care about tokens for a few reasons. Every API call charges by tokens in and tokens out. Longer prompts run slower because the model processes each token. Your engineers will report usage and latency in tokens.

When a stakeholder asks why an AI feature costs more than expected, the answer is almost always input tokens. A long system prompt or a stuffed context will inflate every call. Lenny Rachitsky has written about how AI startups burn margin on inference costs (Rachitsky). Token discipline is a margin lever.

Context windows are working memory

A context window is the maximum number of tokens a model can read in a single call. Frontier models in 2026 ship with windows in the hundreds of thousands of tokens. Older models capped out at 4,000 or 8,000.

The trap is assuming a bigger window means better recall. It does not. Models often skip information buried in the middle of a long context. Researchers call this the "lost in the middle" problem.

What this means for PMs: a 200,000 token window is not a free pass to dump entire document libraries into every prompt. You still need to retrieve and rank the most relevant content. The window is a ceiling. It is not a strategy.

When you scope an AI feature, ask your engineer two questions. How much input does each call need? How will we rank that input by relevance?

Temperature controls randomness

Temperature is a number between 0 and 2 that adjusts how the model picks each next token. Low temperature, near 0, makes the model pick the most probable token every time. Output is consistent. High temperature, around 1 or above, makes the model sample less likely tokens. Output is varied.

For most product features, you want low temperature. Customer support replies, data extraction, code generation, and structured outputs all benefit from consistency. Brainstorming tools and creative writing features are the rare cases where higher temperature gives users variety.

Marty Cagan often warns PMs against shipping features that work in demos but fail in production (Cagan). Temperature is one of the quiet causes of that gap. A demo at temperature 0.7 looks like magic. The same setting in production produces a support bot that gives different answers to the same question on Monday and Tuesday.

Set temperature deliberately. Document the choice in your spec.

RAG vs fine-tuning

This is the question every AI PM gets within their first month: should we use RAG or fine-tuning?

Retrieval-augmented generation, or RAG, fetches relevant documents at query time and feeds them into the prompt. The model reads the documents and answers based on what it just saw. You update behavior by updating the document store.

Fine-tuning changes the model weights. You train on examples of the input and output you want. The model learns the pattern and applies it on every call without needing the examples in context.

When to use RAG:

  • Your knowledge changes often, like product docs, support tickets, or pricing pages
  • You need citations back to source documents
  • You want users to see what the model based its answer on
  • You have a small team and limited ML infrastructure

When to use fine-tuning:

  • You need a specific output format on every call
  • Your task is narrow and the examples are stable
  • You want lower per-call cost on high-volume features
  • Latency matters and you cannot afford a retrieval step

Most production AI features at startups begin with RAG. Jeff Gothelf and Josh Seiden argue that teams should ship the simplest version that learns something real before investing in heavier infrastructure (Gothelf and Seiden). Fine-tuning is heavier infrastructure. RAG ships in a week. Fine-tuning takes a quarter.

You can also combine the two. Fine-tune the model for tone and format, then use RAG to inject fresh facts. This pattern is common in mature AI products.

What this means for your next spec

Four numbers belong in every AI feature spec: average input tokens per call, average output tokens per call, target latency, and temperature. If your engineer cannot answer the first three, you do not have a spec yet. You have a hope.

Talk to your engineer about cost per query at expected volume. Decide between RAG and fine-tuning based on how often your data changes and how strict your output format needs to be. Pick a temperature and write down the reason.

PMs who ship strong AI products in 2026 treat tokens, windows, temperature, and retrieval as product decisions. These four ideas will come up in every review for the rest of your career.

Works cited

Cagan, Marty. Inspired: How to Create Tech Products Customers Love. 2nd ed., Wiley, 2017.

Gothelf, Jeff, and Josh Seiden. Lean UX: Designing Great Products with Agile Teams. 3rd ed., O'Reilly, 2021.

Rachitsky, Lenny. "How AI Is Changing Product Management." Lenny's Newsletter, lennysnewsletter.com.

Back to Live Blog