12.1 What safety filters can and cannot do
Overview and links for this section of the guide.
On this page
What safety filters can do (useful, real)
Safety filters (and safety settings) are good at:
- reducing harmful outputs: blocking or filtering content in risky categories,
- enforcing platform policy: providing a baseline that protects users and providers,
- creating a backstop: catching some failures even if your prompt is imperfect.
This is valuable, especially for public-facing apps.
Filters help catch bad outcomes, but they do not design your product for you.
What safety filters cannot do (common misconceptions)
Common misconceptions that cause builders trouble:
They cannot guarantee truthfulness
Safety filters are not fact-checkers. “Safe” does not mean “correct.” You still need verification habits and grounded context.
They cannot fully prevent prompt injection
If you include untrusted text in context, it can still influence the model. Filters may block some outcomes, but they don’t solve the core design problem: separating data from instructions and limiting tool power.
They cannot protect your secrets
If you put secrets into prompts or logs, filters won’t reliably save you. The safe rule is structural: don’t put secrets into prompts, and redact logs.
They cannot provide a good user experience
Refusals and blocks are product states. Filters don’t explain to your user what to do next. Your app must.
They cannot replace governance and policies
If you’re building a product, you need your own policies for data handling, consent, retention, and escalation. Filters are not your policy framework.
Over-trusting filters makes products brittle: sudden blocks feel like bugs, and unsafe inputs slip through because app-level controls are missing.
Tradeoffs: false positives and false negatives
Any filtering system has tradeoffs:
- false positives: benign requests get blocked.
- false negatives: risky requests slip through.
Your job is to make both cases survivable:
- false positives → clear UX and rephrase guidance,
- false negatives → app-level constraints, validation, and guardrails.
How to design around safety behavior
- Design tasks as transformations: “summarize this text” is safer than “do whatever the user says.”
- Use structured outputs: schemas constrain what can be produced.
- Handle blocks explicitly: treat “blocked/refused” as a normal response type.
- Limit tool power: least privilege, budgets, and stop conditions.
- Log safely: outcomes and metadata, not raw sensitive content.
A practical checklist
- Do we have a “blocked/refused” UI state with clear next steps?
- Do we avoid pasting secrets/sensitive data into prompts?
- Do we separate user-provided data from instructions?
- Do we validate structured outputs and handle invalid outputs?
- Do we have safe logging and retention rules?