Practitioner note

fos-dnvcfos-wj3s

How I define and test boundaries before an AI system touches users, tools, or sensitive workflows.

Learning stageNext private buildProof statusPlanned; needs one FOS endpoint with a blocked prompt-injection test.

Guardrails

What it is

Guardrails are the explicit boundaries around an AI system. They can constrain topics, block unsafe instructions, protect sensitive information, validate outputs, or force escalation to a human. They are not a magic safety layer. They are a set of checks that make the intended boundary testable.

Learning goal

Learn to separate three guardrail questions:

  • What should the system refuse?
  • What should the system escalate?
  • What should the system transform into a safer output?

The learning is real when those rules are backed by tests, not only written in a policy document.

Why it matters in production

Without guardrails, the system's boundary is whatever the model decides in the moment. That is not good enough for user-facing workflows, private data, regulated domains, or tool-using agents. A guardrail layer gives the product a place to enforce policy before a bad request becomes a bad action.

How I actually build it

The first FOS guardrails slice should stay small:

  • Put a guardrail layer in front of one endpoint.
  • Add a topical boundary, a prompt-injection boundary, and an output boundary.
  • Keep the rules outside the prompt where possible.
  • Log the guardrail decision without storing sensitive raw content publicly.
  • Add fixtures that prove one known bad prompt is blocked.

The implementation can use NeMo Guardrails, Guardrails AI, or a small local policy layer if that is enough for the first proof. The important part is the testable boundary.

Practice loop

  • Write five allowed prompts and five blocked prompts.
  • Add two borderline prompts that should escalate rather than answer.
  • Run the set before and after the guardrail layer.
  • Record false positives and false negatives.
  • Update the policy and rerun the set.

Proof artifact

A useful proof is a small prompt set plus a before/after score: allowed, blocked, escalated, and incorrectly handled. The page should link to the ticket that added the guardrail and the command that verifies it.

Current status

This is the next missing build. The public learning page exists, but the FOS guardrail proof has not landed yet.

What worked, what didn't

The likely trap is trying to guardrail everything at once. The first useful version should cover one endpoint and one clear threat model. Broad policy can come later.

Next build

Create a child ticket for one guardrailed FOS endpoint and require a verification command that proves a prompt-injection attempt is blocked.

Further reading