12.1 What safety filters can and cannot do

On this page

What safety filters can do (useful, real)
What safety filters cannot do (common misconceptions)
Tradeoffs: false positives and false negatives
How to design around safety behavior
A practical checklist
Where to go next

What safety filters can do (useful, real)

Safety filters (and safety settings) are good at:

reducing harmful outputs: blocking or filtering content in risky categories,
enforcing platform policy: providing a baseline that protects users and providers,
creating a backstop: catching some failures even if your prompt is imperfect.

This is valuable, especially for public-facing apps.

Backstop, not steering wheel

Filters help catch bad outcomes, but they do not design your product for you.

What safety filters cannot do (common misconceptions)

Common misconceptions that cause builders trouble:

They cannot guarantee truthfulness

Safety filters are not fact-checkers. “Safe” does not mean “correct.” You still need verification habits and grounded context.

They cannot fully prevent prompt injection

If you include untrusted text in context, it can still influence the model. Filters may block some outcomes, but they don’t solve the core design problem: separating data from instructions and limiting tool power.

They cannot protect your secrets

If you put secrets into prompts or logs, filters won’t reliably save you. The safe rule is structural: don’t put secrets into prompts, and redact logs.

They cannot provide a good user experience

Refusals and blocks are product states. Filters don’t explain to your user what to do next. Your app must.

They cannot replace governance and policies

If you’re building a product, you need your own policies for data handling, consent, retention, and escalation. Filters are not your policy framework.

The trap

Over-trusting filters makes products brittle: sudden blocks feel like bugs, and unsafe inputs slip through because app-level controls are missing.

Tradeoffs: false positives and false negatives

Any filtering system has tradeoffs:

false positives: benign requests get blocked.
false negatives: risky requests slip through.

Your job is to make both cases survivable:

false positives → clear UX and rephrase guidance,
false negatives → app-level constraints, validation, and guardrails.

How to design around safety behavior

Design tasks as transformations: “summarize this text” is safer than “do whatever the user says.”
Use structured outputs: schemas constrain what can be produced.
Handle blocks explicitly: treat “blocked/refused” as a normal response type.
Limit tool power: least privilege, budgets, and stop conditions.
Log safely: outcomes and metadata, not raw sensitive content.

A practical checklist

Do we have a “blocked/refused” UI state with clear next steps?
Do we avoid pasting secrets/sensitive data into prompts?
Do we separate user-provided data from instructions?
Do we validate structured outputs and handle invalid outputs?
Do we have safe logging and retention rules?

12.1 What safety filters can and cannot do

What safety filters can do (useful, real)

What safety filters cannot do (common misconceptions)

They cannot guarantee truthfulness

They cannot fully prevent prompt injection

They cannot protect your secrets

They cannot provide a good user experience

They cannot replace governance and policies

Tradeoffs: false positives and false negatives

How to design around safety behavior

A practical checklist

Where to go next