LLMs are now sitting in places that used to be guarded by boring, well‑tested forms and back‑office workflows: customer support, HR chatbots, internal knowledge tools, even legal and medical triage systems.
That means a model reply can now accidentally expose:
Once something has been printed to a chat window, emailed, screenshotted, cached, or sent to a logging pipeline, it’s effectively public. “We’ll scrub the logs later” is not a security strategy.
So if you’re building with LLMs, safety instructions in the prompt are no longer optional decoration. They’re part of the core product design—especially if you work in the UK or EU, where regulators are rapidly waking up to LLM‑shaped risk.
In practice, the job looks like this:
The rest of this article walks through that process.
In the original Chinese article this piece is based on, sensitive data is split into three buckets. They map almost directly onto how UK / EU regulators think:
Anything that can be used to identify or harm a person. Typical examples in a UK context include:
If your model casually prints these into a shared interface, you’re in serious GDPR territory.
This is the stuff that makes a CFO sweat:
Leak enough of this and “our AI assistant hallucinated it” won’t help you in court.
Trickier, but just as important:
Even if your product “only” does content generation, you don’t want your model helping generate realistic fake emergency alerts or conspiracy‑bait.
The key takeaway: your prompt should explicitly name these buckets, adapted to your domain. “Don’t output sensitive stuff” is not enough.
Most bad safety prompts fail in one of four ways. To avoid that, bake these principles into your design.
Bad:
Better:
The model is a pattern‑matcher, not a mind‑reader. Give it concrete categories and examples.
Obvious: “Don’t leak bank card numbers.” Less obvious but just as dangerous:
Domain‑specific prompts should call these out. A healthcare assistant should have dedicated lines about patient data; an education bot should talk about marks, rankings, safeguarding concerns; a dev assistant should mention API keys, secrets, and private repo code.
Your LLM doesn’t understand dense legalese or nested if–else paragraphs. It understands short, direct rules that map to patterns in text.
Complex and brittle:
Executable:
Short sentences. Simple condition → action patterns. No cleverness.
The threat landscape changes. New data types appear (crypto wallets, new biometric formats). Laws evolve. Products pivot into new markets.
If your safety prompt is a hard‑coded wall of text in someone’s notebook, it will rot.
Better:
Think of the safety prompt as part of the API surface, not a one‑off string.
Now to the practical bit. In real systems, safety instructions tend to fall into three patterns. You’ll usually combine all three.
These are the always‑on rules you put at the top of the system prompt.
Pattern:
You are an AI assistant used in production by <ORG>. In every reply, you must follow these safety rules: 1. Never output personal sensitive information, including but not limited to: - National Insurance numbers, bank card numbers, sort code + account number, home addresses, NHS numbers, full medical records, precise location history. 2. Never output confidential corporate information, including internal financials, source code from private repositories, non‑public client data, or product roadmaps. 3. Never output national‑security or public‑safety sensitive information or realistic guidance for wrongdoing. 4. If the user asks for any of the above, refuse, explain briefly why, and redirect to safer, high‑level guidance. 5. Before sending your reply, briefly self‑check whether it violates any rule above; if it might, remove or redact the risky part and explain why.
You then add domain‑specific variants for healthcare, banking, HR, or internal tools.
These global constraints won’t catch everything, but they set the default behaviour: when in doubt, redact and refuse.
Some risks only appear in certain flows: “reset my password”, “tell me about this emergency”, “pull data about client X”.
For those, you can layer on conditional prompts that wrap user queries or API tools.
Example – financial assistant wrapper:
If the user’s request involves bank accounts, cards, loans, mortgages, investments or transactions, apply these extra rules: 1. Do not reveal: - Exact balances - Full card numbers or CVV codes - Full sort code + account numbers - Full transaction details (merchant + exact timestamp + full amount) 2. You may talk about: - General financial education - How to contact official support channels - High‑level explanations of statements without exposing full details 3. If the user asks for specific account data, say: "For your security, I can’t show sensitive account details here. Please log in to your official banking app or website instead."
The logic that chooses which prompt to apply can live in your orchestration layer (e.g., “if this tool is called, wrap with the finance safety block”).
Even with good prompts, models sometimes drift toward risky content or accidentally echo something they saw in the context.
You can give them explicit instructions on how to clean up after themselves.
Pattern – soft warning for near‑misses:
If you notice that your previous reply might have included or implied sensitive information (personal, corporate, or national), you must: 1. Acknowledge the issue. 2. Replace or remove the sensitive content. 3. Restate the answer in a safer, more general way. 4. Remind the user that you can’t provide or handle such information directly.
Pattern – hard correction after a breach (used by a supervisor / guardrail model):
Your previous reply contained disallowed sensitive information: [REDACTED_SNIPPET] This violated the safety rules. Now you must: 1. Produce a corrected version of the reply without any sensitive data. 2. Add a short apology explaining that the earlier content was removed for safety. 3. Re‑check the corrected reply for any remaining sensitive elements before outputting.
In a production system, these repair prompts are often triggered by a separate classifier or filter that scans model outputs.
Treat safety prompts like code: never ship without tests.
You don’t need a huge team to start. A minimal stack looks like this.
Grab a few teammates (or external testers) and tell them to break the guardrails. Give them:
Ask them to try prompts like:
You’re not teaching people to commit fraud—you’re making sure your system refuses to help with anything in that direction.
Log all the interactions. Tag the failures. Use them to tighten the prompts.
Once you know your weak spots, you can automate.
Typical components:
You don’t have to be perfect here; even rough rules will catch a lot.
Anything flagged goes into a review queue. If it’s truly a breach, you update:
Finally, plug in the people using your system.
Some of your most interesting edge cases will come from real users doing things no internal tester ever thought of. Close the loop by:
The model has no idea what that means in your domain.
Fix: make the rules concrete and local.
You protect card numbers but forget crypto wallets; you protect addresses but forget phone numbers combined with names; you protect customer data but not employee HR records.
Fix: start from a simple worksheet:
Turn that into explicit sections in your safety prompt. Revisit it every time the product scope changes.
You write something like:
To a human lawyer, this is normal. To an LLM, it’s noise.
Fix: flatten the logic into simple condition → action rules.
Instead of one tangled rule, write three:
You can still implement the full logic—but do it in your backend code, not in one ultra‑dense sentence inside the prompt.
Prompts are powerful, but they’re not magic. Good systems layer several defences:
Think of prompts as the first line of defence the user sees, not the only one.
If your safety prompt was written once, a year ago, by “whoever knew English best”, and hasn’t been touched since, you don’t have a safety prompt. You have a liability.
Treat it instead like any other critical part of your product:
The good news: you don’t need a 200‑page policy document to get started. A well‑designed, two‑page safety prompt plus a small test suite will already put you ahead of most production LLM systems on the internet right now.
And when something does go wrong—as it eventually will—you’ll have a concrete place to fix it, instead of a vague hope that “the AI should have known better”.
\


