1.1 What the model is actually doing
Overview and links for this section of the guide.
On this page
The short version
An LLM is a machine that predicts the next token in a sequence of tokens. Given your prompt (plus any system/developer instructions and other context), it generates a probability distribution over possible next tokens, picks one, appends it, and repeats until it stops.
That’s why it can feel like a fluent, confident partner: fluency is the point of the objective. Correctness is something you earn through constraints and verification.
LLMs don’t “retrieve truth” by default. They produce plausible continuations shaped by patterns in training data and the instructions you provide.
Token prediction (the core mechanic)
At a high level, the model is trying to estimate something like:
P(next_token | all_previous_tokens)
Then it repeats the process:
- Read the current context (your prompt + conversation so far).
- Compute probabilities for what should come next.
- Select a token (deterministically or via sampling).
- Append it to the output and continue.
This matters because it explains common “weird behaviors”:
- Style mimicry: it tends to continue the tone, structure, and patterns it sees.
- Over-completion: it keeps going until you constrain it (format, length, stopping rules).
- Plausible fabrication: if the context implies something that isn’t true, it may confidently continue that implication.
What is a “token”?
A token is a chunk of text the model uses internally—often a word, part of a word, punctuation, or a common multi-character sequence. Tokens are not the same as characters or words.
- “Context window” means “how many tokens can fit in the model’s working memory.”
- Longer prompts use more tokens and can push out earlier details.
- Different phrasing can change tokenization and therefore change behavior.
If your prompt is long and messy, your output tends to be long and messy. Better prompting is often better editing: smaller scope, clearer constraints, less noise.
How the model learned (in one page)
Most LLMs are trained with a self-supervised objective: given a lot of text, learn to predict the next token. Over huge datasets, the model learns statistical regularities: grammar, common patterns, code idioms, and many facts and relationships that appear frequently in the data.
After the base training, many models go through additional tuning so they follow instructions better and produce safer/more helpful outputs. The exact details differ by provider/model, but the key effect is consistent:
- They become better at following human instructions and producing structured answers.
- They can develop strong “assistant voice” behaviors (polite, confident, explanatory).
- They may still be wrong in subtle ways—because the objective is still text generation, not guaranteed truth.
How it generates output (inference)
During generation, the model produces a probability distribution over next tokens. The system then chooses tokens according to a decoding strategy.
- More deterministic: tends to produce the most likely continuation (repeatable, but can get stuck in a “most likely” rut).
- More random: samples from a wider set of plausible tokens (more creative, but more variance and more chance of mistakes).
You’ll control this later with knobs like temperature and related settings. The point here is: the output is the result of choices among probabilities, not a single “correct answer” being retrieved.
When the model writes code, it hasn’t compiled it, run it, or tested it (unless it has tool access and you ask it to use tools). Treat code output as a draft until you verify it.
Why it sounds confident (even when wrong)
LLMs often present answers in a confident tone because:
- Training favors fluent continuation: confident-sounding text is common in tutorials, docs, and explanations.
- It optimizes for “helpful output”: hedging and uncertainty can look unhelpful unless explicitly requested.
- It lacks direct access to truth by default: it’s not checking the internet, your filesystem, or reality unless you give it tools and ask it to verify.
This is why the core skill in vibe coding is not “writing clever prompts.” It’s designing a loop where the model’s confidence doesn’t matter—because you verify behavior.
What this means for vibe coding
If you internalize “token predictor,” you naturally adopt safer, faster habits:
- Define “done”: you specify acceptance criteria and failure behavior so output can be evaluated.
- Constrain formats: you ask for a plan, a diff, a schema, or a checklist—not an essay.
- Prefer small steps: small diffs are easier to review and easier to debug.
- Use tools to verify: run tests, add logs, validate schemas, reproduce bugs.
- Keep context clean: remove stale instructions and summarize state between iterations.
You drive. The model types. Reality decides.
A quick practice exercise
Do this once, and you’ll feel the difference between “plausible” and “verified.”
- Pick a tiny coding task (e.g. a function that parses an input string).
- Ask the model for: (a) a plan, (b) the code, (c) 5 test cases.
- Run the tests locally.
- When something fails, paste the exact error + failing test back to the model and ask for a minimal fix diff.
The lesson: the model is great at generating drafts quickly, but correctness emerges from the loop.