Preferred shapes
- Edge scrubbers: remove PII, enforce schemas, and reject unsafe inputs early.
- Retriever → Router → Model: isolate each so you can upgrade independently.
- Policy layer: a rules engine that can short-circuit a request before the model runs.
The pattern is boring on purpose. When the system breaks, you want boring. You want small pieces you can explain, test, and swap without a rewrite.
Worked example: a safe “Ask the docs” assistant
Imagine you are building an assistant for internal guidance notes. Users ask questions, and the assistant responds with citations.
Here is the shape I like:
- Request comes in (UI or API).
- Edge scrubber:
- Strip or mask identifiers.
- Enforce input schema (what fields, what types).
- Reject obviously unsafe requests (for example, “paste the whole customer list”).
- Retriever:
- Fetch only the minimum chunks required.
- Attach provenance: dataset id, licence posture, last updated.
- Router:
- Decide whether to answer, ask a clarifying question, or refuse.
- Decide whether to use a cheap model, a better model, or a deterministic template.
- Policy layer:
- Apply “hard rules” that are non negotiable.
- Example: do not generate instructions for wrongdoing, do not output personal data, do not claim certainty without a cited source.
- Model:
- Generate response using the retrieved context.
- Post-processing:
- Validate output format.
- Add citations.
- Record an audit event.
If you can draw this as boxes and arrows and explain every arrow in plain English, you have an architecture you can defend.
Practical defaults (so you do not reinvent the wheel)
- Keep the prompt small: retrieval beats mega prompts. If the prompt needs to be huge to work, your retrieval strategy is the real problem.
- Make policy explicit: the system must know what it is allowed to do, and the team must know why it refused.
- Make retrieval measurable: you should be able to answer “which sources were used” and “how often do we hallucinate a citation”.
Common mistakes (and why they hurt)
- Stuffing everything into one call: “just send the whole database to the model”. This fails on cost, latency, and governance.
- No separation between retrieval and generation: you cannot debug “wrong answer” because you do not know if it was the retriever or the model.
- No refusal mode: the assistant keeps talking even when it does not know. That is how you manufacture confident nonsense.
- No ownership: the system is “AI’s fault” rather than “our service did X”. That is an organisational bug.
Verification checklist (what to test before you trust it)
- Input contracts:
- Invalid input is rejected with a clear error.
- PII is scrubbed before it ever reaches the model.
- Retrieval:
- You can trace which chunks were used.
- Retrieval returns zero chunks cleanly (and the model refuses or asks a question).
- Policy:
- Prompt injection attempts fail in obvious ways.
- The system does not follow tool output that violates policy.
- Observability:
- You can measure p50, p95, p99 latency.
- You can measure cache hit ratio.
- You can measure cost per request.
For the maths: if you only measure average latency, you are lying to yourself. Tail latency matters.
- (p95) means 95 percent of requests are faster than this time.
- If (p95) is 8 seconds, 1 in 20 users are waiting longer than 8 seconds. That is not a corner case. That is your product.
Reflection prompts
- If the model vendor disappears tomorrow, which boxes can you keep, and which ones collapse.
- Which component would you like to A B test first, retriever, router, or model, and why.
- What is your refusal story. When should the system say “no”, and how will you explain that to users.
Observability defaults
- Collect prompts and responses with hashed user identifiers.
- Tag traces with model version, temperature, and data source.
- Ship metrics to a privacy-aware backend (Plausible, OpenTelemetry collector).
Availability matters
Cache embeddings, add fallbacks to smaller local models, and provide static responses when providers are unavailable.
A note on “swappable”
When people say “swappable”, they often mean “we can change vendors”. In practice, swappable means:
- The interfaces are stable.
- The data contracts are stable.
- The evaluation harness exists and you can run it before and after the swap.
If you cannot run an evaluation pack, you are not swapping. You are gambling.
