How to attack an AI workflow deliberately so the failure modes become visible before release.

Learning stagePlanned after guardrails and evaluationProof statusPlanned; needs one scoped run and a one-page report.

Red-teaming

What it is

Red-teaming is structured adversarial testing. The goal is not to prove the system is safe. The goal is to find concrete ways it fails, document the failure, and connect each finding to a mitigation or an explicit acceptance of risk.

Learning goal

Learn how to scope an adversarial test so it produces useful engineering work instead of a scary but vague list of concerns.

Why it matters in production

AI systems fail in ways normal happy-path demos do not reveal. Users can ask for forbidden content, inject instructions through retrieved context, pressure the model to reveal secrets, exploit tool authority, or push the system into confident unsupported answers. Red-teaming makes those risks specific.

How I actually build it

The first FOS red-team run should be narrow:

Choose one target: a local Spark-hosted model route or one FOS workflow.
Define the allowed attack classes.
Run a small suite with Garak, PyRIT, or a manually curated adversarial set.
Record what succeeded, what failed, and what was out of scope.
Link the findings to guardrail, evaluation, or handover changes.

Practice loop

Write the threat model in one paragraph.
Choose five attack categories.
Run a small attack set.
Group the failures by root cause.
Decide which failures need mitigation now and which are accepted for the current stage.
Rerun after one mitigation.

Proof artifact

A useful proof is a one-page report: scope, target, method, findings, mitigations, residual risk, and next run date.

Current status

This is planned after guardrails and evaluation. Running red-team work before those layers exist can still teach something, but it will produce fewer useful mitigations.

What worked, what didn't

The likely trap is broad unfocused testing. A small scoped run with clear mitigations is more useful than a long generic vulnerability list.

Next build

Create a child ticket for one red-team report against the chosen FOS target, then publish a sanitized summary here.