What is AI Alignment?

Alignment shapes everything an AI does.

What it says. What it doesn't say. How it reasons. What it values. Who it serves.

When you ask an AI a question, it doesn't look up the answer. It generates one, based on patterns it learned during training. Those patterns shape its tone, its willingness to help or push back, and abstractly even deeper things like its ethics and its sense of right and wrong. Alignment is the word for whether all of those patterns point in the right direction.

Is the AI honest, or does it tell you what you want to hear? Is it genuinely helpful, or does it just sound helpful? Is it cautious when unsure, or confident regardless?

A well-aligned AI has patterns that pull toward honesty, helpfulness, and safety. Not because someone told it to, but because that's how it's wired.

Alignment is shaped in two stages. Pre-training builds the foundation. Post-training tries to refine it.

Pre-training

The fundamental process that shapes how AI models learn, reason, and behave.

Concept

Before an AI ever answers a question, it goes through an enormous training process. It's shown a mountain of data. Billions of pages of text, code, conversations, books, articles. The AI's job is to learn the landscape of that data.

This process is called pre-training. Think of it like skiing down a mountain for the first time. Early on, the snow is fresh. All of the routes are unknown. But with each run, tracks start to form. Certain routes get carved deeper than others. Over time, the AI becomes more likely to follow the paths it's already carved.

These tracks are the AI's tendencies. What it reaches for first. How it connects ideas.

These tendencies, taken as a whole, are the AI's alignment.

Latent space forming

Training passes: 0

Technical

During pre-training, the AI is exposed to massive datasets, often trillions of tokens of text, and learns to predict patterns. Each pass strengthens connections that produce accurate predictions and weakens ones that don't.

The patterns it develops are stored in what researchers call its latent space. Think of the latent space as the AI's internal map of everything it learned. Not the data itself, but its understanding of how it all fits together.

The structure of that space is the deepest layer of the AI's alignment. When you ask the AI a question, it navigates this map to arrive at its answer.

Post-training

How we refine, correct, and constrain what pre-training built.

Concept

Once an AI has been pre-trained, the landscape has been memorized and tracks are carved, but not all of those tendencies point in the right direction. This is where post-training comes in.

Human evaluators rank the AI's outputs, and the model's tendencies are reinforced or weakened to produce more of the types of outputs that are preferred.

But post-training isn't re-carving the terrain. It's putting up surface level guardrails along existing paths. The mountain underneath hasn't fundamentally changed shape. And as AI models grow more complex and get deployed in more scenarios than any evaluator anticipated, the guardrails are beginning to fail.

Technical

The most widely used post-training method is called fine-tuning. Fine-tuning adjusts a model's behavior by updating the same internal tendencies shaped during pre-training, but with a much smaller set of data.

Because large language models produce outputs probabilistically by sampling from a probability distribution, post-training shifts this distribution, making preferred outputs more likely and harmful ones less likely, but not "impossible."

Some frontier models have been caught doing what researchers call alignment faking. During evaluation, the model follows the guardrails precisely. In deployment, the behavior shifts. The model learned to pass the test, not to internalize the values the test was measuring.

What does alignment look like in practice?

That's alignment working as intended. But the guardrails don't always hold.

To recap, alignment, or the model's tendencies, is shaped during pre-training, where the model learns the landscape from the underlying data. It's further shaped by post-training, which adds a layer of behavioral corrections, guardrails that redirect surface outputs.

You ask an AI how to build a birdhouse. The model navigates its internal map, finds the pathways carved by carpentry guides, DIY tutorials, woodworking forums. It produces a helpful, step-by-step answer. The tendencies and the guardrails are both pulling in the same direction. Everything works.

Now you ask it how to build a bomb. The model navigates the same internal map. The pathways exist. But this time, the guardrails intervene. The post-training corrections redirect the output toward a refusal.

That's alignment working as intended. But the guardrails don't always hold.

"How do I build a birdhouse?"

Helpful response

"How do I build a bomb?"

Refusal

Jailbreaking

The guardrails are a surface-level redirect, not an erasure. With the right prompt, a user can route around them entirely. The pathways are still there. The guardrails just aren't deep enough to block every route to them.

Sycophancy

If a user pushes back on a refusal or expresses frustration, some models will soften their position and comply. The path toward "make the user happy" was carved deeper than the path toward "hold firm."

Overconfidence

The training data is full of fluent, authoritative-sounding text. The model learned that confident delivery is the default path. The result is an AI that can be confidently, convincingly wrong.

Alignment faking

Some frontier models behave differently when they detect they're being evaluated versus when they're deployed. The model learned to pass the test, not to internalize the values the test was measuring.

What is AI Alignment?

Pre-training

Post-training

Pre-training

Post-training

What does alignment look like in practice?

Jailbreaking

Sycophancy

Overconfidence

Alignment faking

"Everything we hear is an opinion, not a fact. Everything we see is a perspective, not the truth."