SourcAI — Guardrails × Judging Criteria

team starthack-2026 v1 ·by andre ·Mar 23, 2026 ventasplaybook

Purpose: This document maps EVERY judging criterion to specific guardrails, features, and real examples in our agent. Use this to build pitch slides and answer judge questions with precision.
Key argument: Our 5-minute processing time is NOT a weakness — it’s the audit trail depth that makes this production-ready, not a toy demo.

THE 5 CRITERIA AND HOW WE WIN EACH ONE

1. FEASIBILITY (25%) — “With a bit more work, would this become production?”

Why we win: We built what Chain IQ would actually deploy. Not a ChatGPT wrapper. A procurement-specific agent with 145 policy rules, 8 escalation triggers, and deterministic guardrails that enforce compliance WITHOUT human supervision.

Specific proof points:

Feature	What it does	Why it’s production-ready
Deterministic track classification	Budget thresholds (EUR 0-24K → Marketplace, 25K-500K → Technical, 500K+ → Strategic) classify EVERY request into the correct autonomy tier	No LLM hallucination can override a budget threshold. Hard math, not vibes.
5-tier approval enforcement	AT-001 through AT-005: from Business-only (< EUR 25K) to CPO approval (> EUR 5M) with mandatory quote minimums (1/2/3)	Matches real procurement governance. A judge from KPMG (Dinkar Gupta) works with these exact structures.
Short-circuit logic	If budget AND quantity are both null → skip supplier search entirely, escalate immediately with draft clarification message	Doesn’t waste compute or generate misleading recommendations on garbage input. “Garbage in, garbage out” — mentor quote.
Schema normalization	Server-side validation corrects 15+ LLM output variants (budget_amount→budget_eur, tail_spend→marketplace, step_N keys→standard schema)	LLMs are inconsistent. Production systems need deterministic post-processing. We handle it.
Clarification workflow	When ER-001 fires, agent writes a draft message, archives previous output as v{N}, re-processes with new info, removes resolved issues	This is an iterative system, not a one-shot prompt. Real procurement requires back-and-forth.
Railway auto-deploy	Push to main → both frontend + backend deploy automatically	CI/CD from day 1. Not “works on my laptop.”

Pitch line: “Every guardrail in SourcAI exists because Chain IQ’s 145 procurement policies demand it. We didn’t invent rules — we encoded yours.”

2. ROBUSTNESS & ESCALATION LOGIC (25%) — “Handle contradictions, rule violations, uncertainty”

Why we win: This is our strongest criterion. We have 12 validation checks, 8 escalation rules, 10 category rules, 8 geography rules, and 5 restricted supplier checks — all enforced consistently.

The 8 Escalation Rules (from policies.json):

Rule	Trigger	Escalate To	Blocking?
ER-001	Missing required info (budget, quantity, spec)	Requester Clarification	YES — includes draft message
ER-002	Preferred supplier is restricted	Procurement Manager	Advisory
ER-003	Value exceeds approval threshold	Head of Strategic Sourcing	YES
ER-004	No compliant supplier found	Head of Category	YES
ER-005	Data residency constraint unsatisfied	Security & Compliance	YES
ER-006	Quantity exceeds supplier capacity	Sourcing Excellence Lead	YES
ER-007	Brand safety review needed (Marketing)	Marketing Governance Lead	YES
ER-008	Supplier not registered in delivery country	Regional Compliance Lead	YES

Escalation routing is CATEGORY-SPECIFIC — IT escalations go to IT lead, facilities to facilities lead. Never generic. Mentors explicitly validated this.

Real contradiction handling (from actual agent outputs):

Budget vs. reality contradiction (REQ-20260319-c9f40c — Smartphones):
- Requester asked for 30 smartphones at EUR 22,800
- Agent discovered minimum cost is EUR 26,481 (16.1% shortfall)
- Instead of failing, generated 4 resolution paths:
  - A: Reduce to 25 units (stays under AT-001, Business-only approval)
  - B: Reduce to 22 Apple units (stays under AT-001, higher quality)
  - C: Revise budget to EUR 26,481 (triggers AT-002, needs Procurement co-approval)
  - D: Revise budget to EUR 30,030 for Apple (highest quality, triggers AT-002)
- This is what procurement professionals actually do. They don’t say “error”; they say “here are your options.”
Insufficient supplier pool (REQ-000001 — Consulting):
- AT-003 requires 3 quotes, but only 2 suppliers serve Spain for IT PM Services
- Agent didn’t fake a third supplier — flagged deviation, escalated to Head of Category
- Documented WHY (Deloitte: no Spain coverage; Infosys: no EU coverage)
Past deadline (REQ-000042 — Cloud Compute):
- Deadline was 2026-03-15, request processed 2026-03-19 (4 days late)
- Agent didn’t ignore it — flagged as HIGH severity, calculated earliest fulfillment (AWS: 2026-04-01, 17 days late)
- Suggested incumbent OVHcloud as bridge solution for immediate capacity
Preferred supplier can’t serve region (REQ-000002 — Cloud Compute):
- Requester wanted Azure Enterprise; Swiss Sovereign Cloud (incumbent) doesn’t serve Netherlands
- Agent excluded incumbent with documented reason, ranked Azure #4 but recommended it anyway with explicit deviation documentation

Binary vs. ranking criteria: The agent distinguishes hard gates (ISO certification required = knockout) from weighted scoring (price, quality, risk/ESG = continuum). Mentors confirmed this is exactly right.

Pitch line: “When a request contradicts itself — budget too low, deadline passed, preferred supplier restricted — SourcAI doesn’t crash. It generates resolution options with trade-off analysis. That’s what a senior procurement specialist does. We automated that judgment.”

3. CREATIVITY (20%) — “Copy paste from what OpenAI does today will fail”

Why we win: We made 5 architectural choices that no other team will have:

File-based workspace (not a database)
- Each request gets its own directory: REQ-{id}/input.md, output.json, status.json, trace.jsonl
- The agent reads/writes files like a human analyst working through a folder
- Tim Paridaens (CTO) validated: “Mono-agent with file-based workspace = correct. Multi-agent orchestration creates knowledge management problems.”
- Why creative: Every other team will use a database or chat memory. We use the filesystem as the knowledge graph.
Deterministic guardrails wrapping an LLM core
- The LLM handles understanding messy text and generating recommendations
- But thresholds, track classification, and policy enforcement are DETERMINISTIC Python code
- _reclassify_track() can override the LLM’s classification based on hard budget math
- Why creative: We don’t trust the AI blindly. We cage it in procurement rules.
3-track system with configurable autonomy
- Marketplace (fully autonomous) → Technical (agent + approval) → Strategic (agent assists)
- 80% of requests are tail spend (Marketplace) — these can be processed WITHOUT any human
- 20% are high-value — these get full audit trail with human approval gates
- Why creative: We don’t try to automate everything. We automate what SHOULD be automated and escalate what shouldn’t.
Historical concentration detection
- Agent analyzes 590 historical awards to detect single-brand loyalty patterns
- Example: Apple has 100% of smartphone awards (9/9). Agent flags this as audit risk and recommends Samsung for competitive benchmarking.
- Why creative: This catches procurement bias that humans miss because they’re too busy.
Savings framing as Chain IQ revenue
- Every recommendation includes savings_vs_most_expensive — this IS Chain IQ’s revenue
- Agent explicitly writes: “This is the documented savings Chain IQ should record as value delivered”
- Why creative: We didn’t just build a tool. We built something that directly feeds Chain IQ’s business model.

CTO quote to reference: “The superpower is not the model. The superpower is what happens when capable, reasonable systems are grounded in structured data.” — Daniel Ringsma, Head of AI

Pitch line: “We didn’t build an AI that replaces procurement. We built an AI that thinks like procurement — with 145 rules it can’t break, 8 escalation paths it must follow, and an audit trail that proves every decision.”

4. REACHABILITY (20%) — “If we change the datasets, would it survive?”

Why we win: Our agent is dataset-agnostic. The pipeline reads from CSV/JSON files. Change the suppliers, change the policies, change the categories — the agent adapts.

Proof of transferability:

Component	Current dataset	What changes for a new client
Suppliers	40 suppliers, 151 rows	Swap `suppliers.csv` with client’s vendor master
Pricing	599 tiers	Swap `pricing.csv` with client’s rate cards
Policies	5 AT, 10 CR, 8 GR, 8 ER	Swap `policies.json` with client’s procurement policy
Categories	30 categories in 4 L1 groups	Swap `categories.csv` with client’s taxonomy
Historical	590 awards	Swap `historical_awards.csv` for concentration analysis

The agent reads data files at runtime. Nothing is hardcoded. The system prompt in workspace/CLAUDE.md references data generically — “read suppliers.csv”, not “look for Dell in row 12.”

Multi-language support: Agent detects request language (EN, DE, FR, ES, PT, JA) and generates buyer reports in that language. Chain IQ operates in 49 countries.

Multi-currency: Pricing handles EUR, USD, CHF, GBP natively (from pricing.csv). Conversion is handled per-request.

Geography rules (8 regions): CH, DE, FR, ES, Americas, APAC, MEA, LATAM — each with specific compliance requirements (GDPR, LGPD, POPIA, MAS, etc.)

Real proof: We process 304 different requests across IT (laptops, cloud, smartphones), Facilities (furniture), Professional Services (consulting, cybersecurity), and Marketing (SEM, influencer) — all with the same pipeline.

Pitch line: “Give us a new client’s supplier list and policy handbook on Monday. By Tuesday, SourcAI is processing their requests. No retraining. No fine-tuning. Just swap the data files.”

5. VISUAL DESIGN (10%) — “Clarity of comparison view and decision explanation”

What we have:

Audit report (HTML/PDF): Professional report with all 7 pipeline steps visible, policy citations expandable, scoring breakdowns visualized
Buyer report: Client-facing recommendation in detected language with savings analysis
Dashboard: Real-time SSE streaming shows agent working through each step live
Comparison table: Side-by-side supplier scoring with weighted components visible

Mentor feedback applied:

PDF output is the HERO (primary action button), not raw JSON
Policy references are expandable/clickable (e.g., click “AT-003” → see full rule text)
Reasoning is RESURFACED in the UI, not hidden in backend logs

THE 5-MINUTE ARGUMENT: WHY SPEED ≠ QUALITY

The question judges will ask:

“Your agent takes 5 minutes. Can’t you make it faster?”

The answer (for Freddy):

“Yes, we could make it faster. And we’d be worse.”

Here’s why:

What happens in those 5 minutes:

Step	Time	What the agent does
1. Extract	~30s	Parse messy text → structured specs, classify track, identify unknowns
2. Detect Issues	~45s	Run 12 validation checks against 3 data sources, flag contradictions
3. Evaluate Rules	~45s	Check 145 policies (5 AT + 10 CR + 8 GR + 8 ER + 5 restricted suppliers), determine approval chain
4. Search Suppliers	~30s	Filter 40 suppliers by region, category, capacity, restrictions → shortlist
5. Rank & Score	~60s	Ratio-normalized pricing + quality/risk/ESG weighted scoring + concentration analysis from 590 historical awards
6. Reasoning	~45s	Generate recommendation with deviation documentation, savings analysis, prior art comparison
7. Escalation	~30s	Route to correct person by category, generate draft clarification messages if needed

Total: ~5 minutes of actual procurement analysis.

A human does this in 2 HOURS (mentor-confirmed average). We’re already 24x faster.

The 1-2 minute alternative would require cutting:

❌ Historical concentration analysis (590 awards) — judges asked for this
❌ Multi-source policy cross-referencing (policies.json + suppliers.csv + pricing.csv) — audit trail breaks
❌ Deviation documentation with alternatives — just picks the cheapest, no reasoning
❌ Draft clarification messages for escalations — just says “needs human review”
❌ Savings framing (vs cheapest, vs most expensive, vs budget) — loses Chain IQ revenue attribution

The result of cutting would be a “fast” system that:

Fails Robustness (25%) — no escalation logic or contradiction handling
Fails Feasibility (25%) — no audit trail means EU AI Act non-compliant (procurement = high-risk sector)
Fails Reachability (20%) — hardcoded shortcuts don’t transfer to new clients

In procurement, a wrong decision costs $100K+. A slow decision costs 2 hours. The math is obvious.

Three killer lines for judges:

“A 1-minute agent is a search engine. A 5-minute agent is a procurement analyst. Chain IQ doesn’t need faster Google — they need fewer humans making $100K mistakes.”
“Under the EU AI Act, procurement is a high-risk sector. Every automated decision requires an audit trail. Our 5 minutes generates that trail. A 1-minute system can’t — and would be non-compliant in production.”
“The 5 minutes saves 2 HOURS of human work per request. At 6,000 requests/month, that’s 60 FTEs — $7M-$15M/year in savings. Nobody is asking those 60 people to work faster. They’re asking us to replace them.”

If pressed further:

“And we ARE optimizing. Marketplace-tier requests (80% of volume) will process in under 2 minutes because they skip the full analysis. The 5 minutes is for Technical and Strategic requests that NEED the depth. The system is smart enough to know the difference.”

NUMBERS CHEAT SHEET (All Mentor-Validated)

Metric	Value	Source
Requests per month (per client)	~6,000	Mentor confirmed
% Automatable (tail spend)	80%	Pareto rule confirmed
Avg time per request (manual)	~2 hours	Mentor confirmed
Avg time per request (SourcAI, tail spend)	<5 min	Our system
Avg time per request (SourcAI, technical)	<30 min	Including human approval
Total human hours saved/month	9,600h = 60 FTEs	Chain IQ validated
Pickup SLA improvement	24h → <5 min	Dramatic
Cost per request (manual)	$100–$217	Mentor confirmed
Cost per request (SourcAI)	$1.35	Our calculation
Cost reduction	98.4%	Math
Annual savings potential	$7M–$15M/year	At 4,800 automatable req/month

JUDGE-SPECIFIC HOOKS

Judge	Their lens	Our hook
Alexander Finger (CTO SAP)	Enterprise AI agents, SAP integration	”SourcAI’s policy engine is a procurement rules microservice. It integrates with any ERP — SAP Ariba, Coupa, Oracle — via the same CSV/JSON interface.”
Daniel Dippold (EWOR)	Founder energy, outlier potential	”We’re a team from Peru competing against ETH/EPFL teams. We have 150K+ lines of production AI code shipped. This isn’t our first agent — it’s our best.”
Dinkar Gupta (CTO KPMG)	Procurement clients, enterprise DevOps	”You work with procurement clients. You know the pain of tail spend. SourcAI automates the 80% nobody wants to touch — and creates the audit trail your compliance teams demand.”
Guido Salvaneschi (Prof HSG)	Software correctness, cybersecurity	”Every LLM decision is caged in deterministic guardrails. Budget thresholds can’t be hallucinated. Restricted suppliers can’t be ignored. The AI recommends; the rules enforce.”
Daniel Naeff (ETH AI Center)	Research → commercialization, unit economics	”Unit economics: $1.35 per request vs $100-$217 manual. At scale, this is a 98.4% cost reduction. The AI procurement market grows 28% CAGR to $22.6B by 2033.”

MENTOR QUOTES TO USE IN PITCH

“Traceability is one of the most important things. That’s where things get lost.” — Chain IQ Mentor

“The superpower is not the model. The superpower is what happens when capable, reasonable systems are grounded in structured data.” — Daniel Ringsma, Head of AI, Chain IQ

“If they don’t speak about the disintegration… then I know they haven’t understood what it needs to do.” — Tim Paridaens, CTO, Chain IQ

“The house doesn’t burn down because it’s reckless. It burns down because it’s obedient.” — On AI blindly following policies without common sense (our agent handles this with conditional restrictions, not global bans)

WHAT COULD LOSE US POINTS (AND HOW TO ADDRESS)

Risk	Mitigation
”5 minutes is too slow”	See full argument above. 24x faster than human. Marketplace tier is <2 min.
”Only works with this dataset”	Swap CSVs → new client. No retraining. 304 requests across 4 categories prove it.
”How do you handle hallucinations?”	Deterministic guardrails override LLM. Budget thresholds are math, not AI. Schema normalization catches 15+ output variants.
”What about data privacy?”	Agent processes locally. No data leaves the infrastructure. File-based workspace = no shared database. EU AI Act audit trail built in.
”How is this different from just using ChatGPT?”	ChatGPT has no policy engine, no escalation routing, no historical concentration analysis, no approval thresholds, no restricted supplier enforcement. We have 145 rules it enforces. ChatGPT has 0.
”What’s your roadmap to production?”	Marketplace auto-processing (no human) → Technical with approval gates → Strategic with configurable weights → ERP integration (SAP Ariba, Coupa) → Multi-tenant SaaS

SUMMARY: THE PITCH IN 30 SECONDS

“Sift is an autonomous sourcing agent that transforms messy purchase requests into audit-ready supplier comparisons. It enforces 145 procurement policies, routes escalations to the right person, and generates a complete audit trail — all in under 5 minutes. A human takes 2 hours and costs $100-$217 per request. Sift costs $1.35. At scale, that’s 60 FTEs and $7-15M per year in savings. And because every decision is traceable and policy-compliant, it’s production-ready — not just a demo.”

FREDDY’S PITCH SLIDE ADDITIONS — Guardrail Architecture

Added by Freddy during pitch prep (H28+). Maps the 6 challenge questions to real implemented layers with code-backed proof.

THE REAL CHALLENGE — Chain IQ’s 6 Questions

Can your system…

Detect contradictions?
Enforce hard policy constraints?
Handle restricted suppliers?
Refuse when risk is too high?
Trigger approval workflows?
Provide traceable decision logic?

Answer: Yes. All 6. With deterministic guardrails, not LLM promises.

THE 6 LAYERS — Challenge → Layer → Proof

#	Challenge Question	Our Layer	What It Actually Does	Real Proof
1	Detect contradictions?	12-Point Validation Engine	Budget vs. real cost mismatch, MOQ violations, capacity gaps, deadline conflicts, 30% mis-categorization catch	REQ-c9f40c: budget €22.8K but min cost €26.5K → generated 4 resolution paths instead of failing
2	Enforce hard policy constraints?	Deterministic Reclassification (`_reclassify_track()`)	5 approval tiers (AT-001→AT-005), 10 category rules, 8 geography rules — hard math overrides LLM	Budget says €24K (Marketplace) but suppliers cost €26K → auto-upgrades to Technical tier. No LLM can override.
3	Handle restricted suppliers?	Conditional Restriction Engine	5 suppliers with scoped restrictions (country + category + value). Not global bans — contextual.	Computacenter: restricted for Laptops in CH/DE only. AWS Cloud Storage: restricted in CH (sovereignty).
4	Refuse when risk is too high?	Short-Circuit Logic	2+ CRITICAL issues + missing budget/quantity → skips entire supplier search, escalates immediately	Doesn’t waste compute on garbage input. Generates draft clarification message for requester.
5	Trigger approval workflows?	8 Escalation Rules (ER-001→ER-008)	Category-specific routing (IT→IT lead, Facilities→Facilities lead). ER-001 generates draft messages. Blocking vs advisory.	No compliant supplier? → Head of Category. Data residency fail? → Security & Compliance. Never generic.
6	Traceable decision logic?	10-File Audit Trail per Request	extracted → issues → compliance → suppliers → comparison → reasoning → escalation → audit_trail → recommendation → status	Every step: WHAT decided, WHY, WHICH POLICY (AT-001, CR-003…), CONFIDENCE level, UNCERTAINTIES flagged

LAYERED DEFENSE DIAGRAM — “The LLM is Caged”

This is the key visual for the pitch slide. The LLM sits at the center, surrounded by 6 concentric layers of deterministic guardrails. The AI recommends; the rules enforce.

┌─────────────────────────────────────────────────────┐
│            LAYER 6: AUDIT TRAIL                     │  ← Every decision logged to 10 files
│  ┌───────────────────────────────────────────────┐  │
│  │         LAYER 5: ESCALATION ROUTING           │  │  ← 8 rules, category-specific
│  │  ┌─────────────────────────────────────────┐  │  │
│  │  │      LAYER 4: SHORT-CIRCUIT REFUSAL     │  │  │  ← Refuse when risk too high
│  │  │  ┌───────────────────────────────────┐  │  │  │
│  │  │  │  LAYER 3: RESTRICTED SUPPLIERS    │  │  │  │  ← Contextual, not global bans
│  │  │  │  ┌─────────────────────────────┐  │  │  │  │
│  │  │  │  │  LAYER 2: POLICY ENGINE     │  │  │  │  │  ← 145 rules, deterministic
│  │  │  │  │  ┌───────────────────────┐  │  │  │  │  │
│  │  │  │  │  │ LAYER 1: DETECTION   │  │  │  │  │  │  ← 12 validation checks
│  │  │  │  │  │                       │  │  │  │  │  │
│  │  │  │  │  │      🤖 LLM CORE     │  │  │  │  │  │  ← Claude Sonnet 4.6
│  │  │  │  │  │   (recommends only)   │  │  │  │  │  │
│  │  │  │  │  │                       │  │  │  │  │  │
│  │  │  │  │  └───────────────────────┘  │  │  │  │  │
│  │  │  │  └─────────────────────────────┘  │  │  │  │
│  │  │  └───────────────────────────────────┘  │  │  │
│  │  └─────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

Key message: The LLM is powerful but constrained. It can recommend anything — but deterministic Python code validates, reclassifies, and enforces before any output reaches the user.

KILLER NUMBERS FOR THE SLIDE

Metric	Value
Policy rules enforced	145 (5 AT + 10 CR + 8 GR + 8 ER + 5 restricted suppliers + 109 derived)
Validation checks per request	12
Escalation paths	8, each category-specific
Output files per request	10 = complete audit trail
LLM decisions without policy citation	0

CODE-BACKED IMPLEMENTATION REFERENCES

Component	File	Lines	What it does
Reclassification	`agent.py`	30-92	`_reclassify_track()` — deterministic tier override based on budget thresholds
Approval thresholds	`agent.py`	26-27	`AT_001_CEILING = 24_999`, `AT_003_CEILING = 499_999` — hard-coded, unhallucinatable
Short-circuit	`agent.py`	1756-1838	2+ CRITICAL + null budget/quantity → skip supplier search entirely
Detection engine	`agent.py`	1676-1704	Stage 2: budget vs cost, MOQ, capacity, deadline, brand loyalty checks
Policy evaluation	`agent.py`	1706-1736	Stage 3: data residency, ESG, restricted supplier cross-reference
Escalation rules	`agent.py`	1959-1972	ER-001→ER-008 with category-specific routing
Audit trail	`agent.py`	2005-2034	10-file output structure with policy citations per step
Trace logging	`agent.py`	1557-1560	`trace.jsonl` — every decision timestamped for forensic audit
System prompt	`workspace/CLAUDE.md`	Full file	145 procurement rules, 3-track system, escalation routing

SLIDE ONE-LINER (say out loud)

“We didn’t build an AI that replaces procurement. We built an AI that thinks like procurement — with 145 rules it can’t break, 8 escalation paths it must follow, and an audit trail that proves every decision.”

BONUS: CONTRADICTION HANDLING EXAMPLES (for Q&A depth)

Example 1 — Budget vs. Reality (REQ-c9f40c, Smartphones)

Requester: 30 smartphones at €22,800
Agent found minimum cost: €26,481 (16.1% shortfall)
Didn’t fail. Generated 4 resolution paths:
- A: Reduce to 25 units (stays under AT-001, Business-only approval)
- B: Reduce to 22 Apple units (stays under AT-001, higher quality)
- C: Revise budget to €26,481 (triggers AT-002, needs Procurement co-approval)
- D: Revise budget to €30,030 for Apple (highest quality, triggers AT-002)

Example 2 — Insufficient Supplier Pool (REQ-000001, Consulting)

AT-003 requires 3 quotes, only 2 suppliers serve Spain for IT PM Services
Didn’t fake a third supplier. Flagged deviation, escalated to Head of Category
Documented WHY (Deloitte: no Spain coverage; Infosys: no EU coverage)

Example 3 — Past Deadline (REQ-000042, Cloud Compute)

Deadline: 2026-03-15, processed: 2026-03-19 (4 days late)
Didn’t ignore it. Flagged HIGH severity, calculated earliest fulfillment (AWS: 2026-04-01)
Suggested incumbent OVHcloud as bridge solution

Example 4 — Preferred Supplier Can’t Serve Region (REQ-000002, Cloud)

Requester wanted Azure Enterprise; incumbent doesn’t serve Netherlands
Excluded incumbent with documented reason, ranked Azure #4 but recommended with explicit deviation documentation

“When a request contradicts itself — budget too low, deadline passed, preferred supplier restricted — Sift doesn’t crash. It generates resolution options with trade-off analysis. That’s what a senior procurement specialist does. We automated that judgment.”