SourcAI — Guardrails × Judging Criteria

team starthack-2026 v1 by andre Mar 23, 2026 hackathonpitchstrategy

SourcAI — Guardrails × Judging Criteria: Pitch Reference for Freddy

Purpose: This document maps EVERY judging criterion to specific guardrails, features, and real examples in our agent. Use this to build pitch slides and answer judge questions with precision. Key argument: Our 5-minute processing time is NOT a weakness — it’s the audit trail depth that makes this production-ready, not a toy demo.


THE 5 CRITERIA AND HOW WE WIN EACH ONE

1. FEASIBILITY (25%) — “With a bit more work, would this become production?”

Why we win: We built what Chain IQ would actually deploy. Not a ChatGPT wrapper. A procurement-specific agent with 145 policy rules, 8 escalation triggers, and deterministic guardrails that enforce compliance WITHOUT human supervision.

Specific proof points:

FeatureWhat it doesWhy it’s production-ready
Deterministic track classificationBudget thresholds (EUR 0-24K → Marketplace, 25K-500K → Technical, 500K+ → Strategic) classify EVERY request into the correct autonomy tierNo LLM hallucination can override a budget threshold. Hard math, not vibes.
5-tier approval enforcementAT-001 through AT-005: from Business-only (< EUR 25K) to CPO approval (> EUR 5M) with mandatory quote minimums (1/2/3)Matches real procurement governance. A judge from KPMG (Dinkar Gupta) works with these exact structures.
Short-circuit logicIf budget AND quantity are both null → skip supplier search entirely, escalate immediately with draft clarification messageDoesn’t waste compute or generate misleading recommendations on garbage input. “Garbage in, garbage out” — mentor quote.
Schema normalizationServer-side validation corrects 15+ LLM output variants (budget_amount→budget_eur, tail_spend→marketplace, step_N keys→standard schema)LLMs are inconsistent. Production systems need deterministic post-processing. We handle it.
Clarification workflowWhen ER-001 fires, agent writes a draft message, archives previous output as v{N}, re-processes with new info, removes resolved issuesThis is an iterative system, not a one-shot prompt. Real procurement requires back-and-forth.
Railway auto-deployPush to main → both frontend + backend deploy automaticallyCI/CD from day 1. Not “works on my laptop.”

Pitch line: “Every guardrail in SourcAI exists because Chain IQ’s 145 procurement policies demand it. We didn’t invent rules — we encoded yours.”


2. ROBUSTNESS & ESCALATION LOGIC (25%) — “Handle contradictions, rule violations, uncertainty”

Why we win: This is our strongest criterion. We have 12 validation checks, 8 escalation rules, 10 category rules, 8 geography rules, and 5 restricted supplier checks — all enforced consistently.

The 8 Escalation Rules (from policies.json):

RuleTriggerEscalate ToBlocking?
ER-001Missing required info (budget, quantity, spec)Requester ClarificationYES — includes draft message
ER-002Preferred supplier is restrictedProcurement ManagerAdvisory
ER-003Value exceeds approval thresholdHead of Strategic SourcingYES
ER-004No compliant supplier foundHead of CategoryYES
ER-005Data residency constraint unsatisfiedSecurity & ComplianceYES
ER-006Quantity exceeds supplier capacitySourcing Excellence LeadYES
ER-007Brand safety review needed (Marketing)Marketing Governance LeadYES
ER-008Supplier not registered in delivery countryRegional Compliance LeadYES

Escalation routing is CATEGORY-SPECIFIC — IT escalations go to IT lead, facilities to facilities lead. Never generic. Mentors explicitly validated this.

Real contradiction handling (from actual agent outputs):

  1. Budget vs. reality contradiction (REQ-20260319-c9f40c — Smartphones):

    • Requester asked for 30 smartphones at EUR 22,800
    • Agent discovered minimum cost is EUR 26,481 (16.1% shortfall)
    • Instead of failing, generated 4 resolution paths:
      • A: Reduce to 25 units (stays under AT-001, Business-only approval)
      • B: Reduce to 22 Apple units (stays under AT-001, higher quality)
      • C: Revise budget to EUR 26,481 (triggers AT-002, needs Procurement co-approval)
      • D: Revise budget to EUR 30,030 for Apple (highest quality, triggers AT-002)
    • This is what procurement professionals actually do. They don’t say “error”; they say “here are your options.”
  2. Insufficient supplier pool (REQ-000001 — Consulting):

    • AT-003 requires 3 quotes, but only 2 suppliers serve Spain for IT PM Services
    • Agent didn’t fake a third supplier — flagged deviation, escalated to Head of Category
    • Documented WHY (Deloitte: no Spain coverage; Infosys: no EU coverage)
  3. Past deadline (REQ-000042 — Cloud Compute):

    • Deadline was 2026-03-15, request processed 2026-03-19 (4 days late)
    • Agent didn’t ignore it — flagged as HIGH severity, calculated earliest fulfillment (AWS: 2026-04-01, 17 days late)
    • Suggested incumbent OVHcloud as bridge solution for immediate capacity
  4. Preferred supplier can’t serve region (REQ-000002 — Cloud Compute):

    • Requester wanted Azure Enterprise; Swiss Sovereign Cloud (incumbent) doesn’t serve Netherlands
    • Agent excluded incumbent with documented reason, ranked Azure #4 but recommended it anyway with explicit deviation documentation

Binary vs. ranking criteria: The agent distinguishes hard gates (ISO certification required = knockout) from weighted scoring (price, quality, risk/ESG = continuum). Mentors confirmed this is exactly right.

Pitch line: “When a request contradicts itself — budget too low, deadline passed, preferred supplier restricted — SourcAI doesn’t crash. It generates resolution options with trade-off analysis. That’s what a senior procurement specialist does. We automated that judgment.”


3. CREATIVITY (20%) — “Copy paste from what OpenAI does today will fail”

Why we win: We made 5 architectural choices that no other team will have:

  1. File-based workspace (not a database)

    • Each request gets its own directory: REQ-{id}/input.md, output.json, status.json, trace.jsonl
    • The agent reads/writes files like a human analyst working through a folder
    • Tim Paridaens (CTO) validated: “Mono-agent with file-based workspace = correct. Multi-agent orchestration creates knowledge management problems.”
    • Why creative: Every other team will use a database or chat memory. We use the filesystem as the knowledge graph.
  2. Deterministic guardrails wrapping an LLM core

    • The LLM handles understanding messy text and generating recommendations
    • But thresholds, track classification, and policy enforcement are DETERMINISTIC Python code
    • _reclassify_track() can override the LLM’s classification based on hard budget math
    • Why creative: We don’t trust the AI blindly. We cage it in procurement rules.
  3. 3-track system with configurable autonomy

    • Marketplace (fully autonomous) → Technical (agent + approval) → Strategic (agent assists)
    • 80% of requests are tail spend (Marketplace) — these can be processed WITHOUT any human
    • 20% are high-value — these get full audit trail with human approval gates
    • Why creative: We don’t try to automate everything. We automate what SHOULD be automated and escalate what shouldn’t.
  4. Historical concentration detection

    • Agent analyzes 590 historical awards to detect single-brand loyalty patterns
    • Example: Apple has 100% of smartphone awards (9/9). Agent flags this as audit risk and recommends Samsung for competitive benchmarking.
    • Why creative: This catches procurement bias that humans miss because they’re too busy.
  5. Savings framing as Chain IQ revenue

    • Every recommendation includes savings_vs_most_expensive — this IS Chain IQ’s revenue
    • Agent explicitly writes: “This is the documented savings Chain IQ should record as value delivered”
    • Why creative: We didn’t just build a tool. We built something that directly feeds Chain IQ’s business model.

CTO quote to reference: “The superpower is not the model. The superpower is what happens when capable, reasonable systems are grounded in structured data.” — Daniel Ringsma, Head of AI

Pitch line: “We didn’t build an AI that replaces procurement. We built an AI that thinks like procurement — with 145 rules it can’t break, 8 escalation paths it must follow, and an audit trail that proves every decision.”


4. REACHABILITY (20%) — “If we change the datasets, would it survive?”

Why we win: Our agent is dataset-agnostic. The pipeline reads from CSV/JSON files. Change the suppliers, change the policies, change the categories — the agent adapts.

Proof of transferability:

ComponentCurrent datasetWhat changes for a new client
Suppliers40 suppliers, 151 rowsSwap suppliers.csv with client’s vendor master
Pricing599 tiersSwap pricing.csv with client’s rate cards
Policies5 AT, 10 CR, 8 GR, 8 ERSwap policies.json with client’s procurement policy
Categories30 categories in 4 L1 groupsSwap categories.csv with client’s taxonomy
Historical590 awardsSwap historical_awards.csv for concentration analysis

The agent reads data files at runtime. Nothing is hardcoded. The system prompt in workspace/CLAUDE.md references data generically — “read suppliers.csv”, not “look for Dell in row 12.”

Multi-language support: Agent detects request language (EN, DE, FR, ES, PT, JA) and generates buyer reports in that language. Chain IQ operates in 49 countries.

Multi-currency: Pricing handles EUR, USD, CHF, GBP natively (from pricing.csv). Conversion is handled per-request.

Geography rules (8 regions): CH, DE, FR, ES, Americas, APAC, MEA, LATAM — each with specific compliance requirements (GDPR, LGPD, POPIA, MAS, etc.)

Real proof: We process 304 different requests across IT (laptops, cloud, smartphones), Facilities (furniture), Professional Services (consulting, cybersecurity), and Marketing (SEM, influencer) — all with the same pipeline.

Pitch line: “Give us a new client’s supplier list and policy handbook on Monday. By Tuesday, SourcAI is processing their requests. No retraining. No fine-tuning. Just swap the data files.”


5. VISUAL DESIGN (10%) — “Clarity of comparison view and decision explanation”

What we have:

Mentor feedback applied:


THE 5-MINUTE ARGUMENT: WHY SPEED ≠ QUALITY

The question judges will ask:

“Your agent takes 5 minutes. Can’t you make it faster?”

The answer (for Freddy):

“Yes, we could make it faster. And we’d be worse.”

Here’s why:

What happens in those 5 minutes:

StepTimeWhat the agent does
1. Extract~30sParse messy text → structured specs, classify track, identify unknowns
2. Detect Issues~45sRun 12 validation checks against 3 data sources, flag contradictions
3. Evaluate Rules~45sCheck 145 policies (5 AT + 10 CR + 8 GR + 8 ER + 5 restricted suppliers), determine approval chain
4. Search Suppliers~30sFilter 40 suppliers by region, category, capacity, restrictions → shortlist
5. Rank & Score~60sRatio-normalized pricing + quality/risk/ESG weighted scoring + concentration analysis from 590 historical awards
6. Reasoning~45sGenerate recommendation with deviation documentation, savings analysis, prior art comparison
7. Escalation~30sRoute to correct person by category, generate draft clarification messages if needed

Total: ~5 minutes of actual procurement analysis.

A human does this in 2 HOURS (mentor-confirmed average). We’re already 24x faster.

The 1-2 minute alternative would require cutting:

The result of cutting would be a “fast” system that:

In procurement, a wrong decision costs $100K+. A slow decision costs 2 hours. The math is obvious.

Three killer lines for judges:

  1. “A 1-minute agent is a search engine. A 5-minute agent is a procurement analyst. Chain IQ doesn’t need faster Google — they need fewer humans making $100K mistakes.”

  2. “Under the EU AI Act, procurement is a high-risk sector. Every automated decision requires an audit trail. Our 5 minutes generates that trail. A 1-minute system can’t — and would be non-compliant in production.”

  3. “The 5 minutes saves 2 HOURS of human work per request. At 6,000 requests/month, that’s 60 FTEs — $7M-$15M/year in savings. Nobody is asking those 60 people to work faster. They’re asking us to replace them.”

If pressed further:

“And we ARE optimizing. Marketplace-tier requests (80% of volume) will process in under 2 minutes because they skip the full analysis. The 5 minutes is for Technical and Strategic requests that NEED the depth. The system is smart enough to know the difference.”


NUMBERS CHEAT SHEET (All Mentor-Validated)

MetricValueSource
Requests per month (per client)~6,000Mentor confirmed
% Automatable (tail spend)80%Pareto rule confirmed
Avg time per request (manual)~2 hoursMentor confirmed
Avg time per request (SourcAI, tail spend)<5 minOur system
Avg time per request (SourcAI, technical)<30 minIncluding human approval
Total human hours saved/month9,600h = 60 FTEsChain IQ validated
Pickup SLA improvement24h → <5 minDramatic
Cost per request (manual)$100–$217Mentor confirmed
Cost per request (SourcAI)$1.35Our calculation
Cost reduction98.4%Math
Annual savings potential$7M–$15M/yearAt 4,800 automatable req/month

JUDGE-SPECIFIC HOOKS

JudgeTheir lensOur hook
Alexander Finger (CTO SAP)Enterprise AI agents, SAP integration”SourcAI’s policy engine is a procurement rules microservice. It integrates with any ERP — SAP Ariba, Coupa, Oracle — via the same CSV/JSON interface.”
Daniel Dippold (EWOR)Founder energy, outlier potential”We’re a team from Peru competing against ETH/EPFL teams. We have 150K+ lines of production AI code shipped. This isn’t our first agent — it’s our best.”
Dinkar Gupta (CTO KPMG)Procurement clients, enterprise DevOps”You work with procurement clients. You know the pain of tail spend. SourcAI automates the 80% nobody wants to touch — and creates the audit trail your compliance teams demand.”
Guido Salvaneschi (Prof HSG)Software correctness, cybersecurity”Every LLM decision is caged in deterministic guardrails. Budget thresholds can’t be hallucinated. Restricted suppliers can’t be ignored. The AI recommends; the rules enforce.”
Daniel Naeff (ETH AI Center)Research → commercialization, unit economics”Unit economics: $1.35 per request vs $100-$217 manual. At scale, this is a 98.4% cost reduction. The AI procurement market grows 28% CAGR to $22.6B by 2033.”

MENTOR QUOTES TO USE IN PITCH

“Traceability is one of the most important things. That’s where things get lost.” — Chain IQ Mentor

“The superpower is not the model. The superpower is what happens when capable, reasonable systems are grounded in structured data.” — Daniel Ringsma, Head of AI, Chain IQ

“If they don’t speak about the disintegration… then I know they haven’t understood what it needs to do.” — Tim Paridaens, CTO, Chain IQ

“The house doesn’t burn down because it’s reckless. It burns down because it’s obedient.” — On AI blindly following policies without common sense (our agent handles this with conditional restrictions, not global bans)


WHAT COULD LOSE US POINTS (AND HOW TO ADDRESS)

RiskMitigation
”5 minutes is too slow”See full argument above. 24x faster than human. Marketplace tier is <2 min.
”Only works with this dataset”Swap CSVs → new client. No retraining. 304 requests across 4 categories prove it.
”How do you handle hallucinations?”Deterministic guardrails override LLM. Budget thresholds are math, not AI. Schema normalization catches 15+ output variants.
”What about data privacy?”Agent processes locally. No data leaves the infrastructure. File-based workspace = no shared database. EU AI Act audit trail built in.
”How is this different from just using ChatGPT?”ChatGPT has no policy engine, no escalation routing, no historical concentration analysis, no approval thresholds, no restricted supplier enforcement. We have 145 rules it enforces. ChatGPT has 0.
”What’s your roadmap to production?”Marketplace auto-processing (no human) → Technical with approval gates → Strategic with configurable weights → ERP integration (SAP Ariba, Coupa) → Multi-tenant SaaS

SUMMARY: THE PITCH IN 30 SECONDS

“Sift is an autonomous sourcing agent that transforms messy purchase requests into audit-ready supplier comparisons. It enforces 145 procurement policies, routes escalations to the right person, and generates a complete audit trail — all in under 5 minutes. A human takes 2 hours and costs $100-$217 per request. Sift costs $1.35. At scale, that’s 60 FTEs and $7-15M per year in savings. And because every decision is traceable and policy-compliant, it’s production-ready — not just a demo.”



FREDDY’S PITCH SLIDE ADDITIONS — Guardrail Architecture

Added by Freddy during pitch prep (H28+). Maps the 6 challenge questions to real implemented layers with code-backed proof.


THE REAL CHALLENGE — Chain IQ’s 6 Questions

Can your system…

  1. Detect contradictions?
  2. Enforce hard policy constraints?
  3. Handle restricted suppliers?
  4. Refuse when risk is too high?
  5. Trigger approval workflows?
  6. Provide traceable decision logic?

Answer: Yes. All 6. With deterministic guardrails, not LLM promises.


THE 6 LAYERS — Challenge → Layer → Proof

#Challenge QuestionOur LayerWhat It Actually DoesReal Proof
1Detect contradictions?12-Point Validation EngineBudget vs. real cost mismatch, MOQ violations, capacity gaps, deadline conflicts, 30% mis-categorization catchREQ-c9f40c: budget €22.8K but min cost €26.5K → generated 4 resolution paths instead of failing
2Enforce hard policy constraints?Deterministic Reclassification (_reclassify_track())5 approval tiers (AT-001→AT-005), 10 category rules, 8 geography rules — hard math overrides LLMBudget says €24K (Marketplace) but suppliers cost €26K → auto-upgrades to Technical tier. No LLM can override.
3Handle restricted suppliers?Conditional Restriction Engine5 suppliers with scoped restrictions (country + category + value). Not global bans — contextual.Computacenter: restricted for Laptops in CH/DE only. AWS Cloud Storage: restricted in CH (sovereignty).
4Refuse when risk is too high?Short-Circuit Logic2+ CRITICAL issues + missing budget/quantity → skips entire supplier search, escalates immediatelyDoesn’t waste compute on garbage input. Generates draft clarification message for requester.
5Trigger approval workflows?8 Escalation Rules (ER-001→ER-008)Category-specific routing (IT→IT lead, Facilities→Facilities lead). ER-001 generates draft messages. Blocking vs advisory.No compliant supplier? → Head of Category. Data residency fail? → Security & Compliance. Never generic.
6Traceable decision logic?10-File Audit Trail per Requestextracted → issues → compliance → suppliers → comparison → reasoning → escalation → audit_trail → recommendation → statusEvery step: WHAT decided, WHY, WHICH POLICY (AT-001, CR-003…), CONFIDENCE level, UNCERTAINTIES flagged

LAYERED DEFENSE DIAGRAM — “The LLM is Caged”

This is the key visual for the pitch slide. The LLM sits at the center, surrounded by 6 concentric layers of deterministic guardrails. The AI recommends; the rules enforce.

┌─────────────────────────────────────────────────────┐
│            LAYER 6: AUDIT TRAIL                     │  ← Every decision logged to 10 files
│  ┌───────────────────────────────────────────────┐  │
│  │         LAYER 5: ESCALATION ROUTING           │  │  ← 8 rules, category-specific
│  │  ┌─────────────────────────────────────────┐  │  │
│  │  │      LAYER 4: SHORT-CIRCUIT REFUSAL     │  │  │  ← Refuse when risk too high
│  │  │  ┌───────────────────────────────────┐  │  │  │
│  │  │  │  LAYER 3: RESTRICTED SUPPLIERS    │  │  │  │  ← Contextual, not global bans
│  │  │  │  ┌─────────────────────────────┐  │  │  │  │
│  │  │  │  │  LAYER 2: POLICY ENGINE     │  │  │  │  │  ← 145 rules, deterministic
│  │  │  │  │  ┌───────────────────────┐  │  │  │  │  │
│  │  │  │  │  │ LAYER 1: DETECTION   │  │  │  │  │  │  ← 12 validation checks
│  │  │  │  │  │                       │  │  │  │  │  │
│  │  │  │  │  │      🤖 LLM CORE     │  │  │  │  │  │  ← Claude Sonnet 4.6
│  │  │  │  │  │   (recommends only)   │  │  │  │  │  │
│  │  │  │  │  │                       │  │  │  │  │  │
│  │  │  │  │  └───────────────────────┘  │  │  │  │  │
│  │  │  │  └─────────────────────────────┘  │  │  │  │
│  │  │  └───────────────────────────────────┘  │  │  │
│  │  └─────────────────────────────────────────┘  │  │
│  └───────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────┘

Key message: The LLM is powerful but constrained. It can recommend anything — but deterministic Python code validates, reclassifies, and enforces before any output reaches the user.


KILLER NUMBERS FOR THE SLIDE

MetricValue
Policy rules enforced145 (5 AT + 10 CR + 8 GR + 8 ER + 5 restricted suppliers + 109 derived)
Validation checks per request12
Escalation paths8, each category-specific
Output files per request10 = complete audit trail
LLM decisions without policy citation0

CODE-BACKED IMPLEMENTATION REFERENCES

ComponentFileLinesWhat it does
Reclassificationagent.py30-92_reclassify_track() — deterministic tier override based on budget thresholds
Approval thresholdsagent.py26-27AT_001_CEILING = 24_999, AT_003_CEILING = 499_999 — hard-coded, unhallucinatable
Short-circuitagent.py1756-18382+ CRITICAL + null budget/quantity → skip supplier search entirely
Detection engineagent.py1676-1704Stage 2: budget vs cost, MOQ, capacity, deadline, brand loyalty checks
Policy evaluationagent.py1706-1736Stage 3: data residency, ESG, restricted supplier cross-reference
Escalation rulesagent.py1959-1972ER-001→ER-008 with category-specific routing
Audit trailagent.py2005-203410-file output structure with policy citations per step
Trace loggingagent.py1557-1560trace.jsonl — every decision timestamped for forensic audit
System promptworkspace/CLAUDE.mdFull file145 procurement rules, 3-track system, escalation routing

SLIDE ONE-LINER (say out loud)

“We didn’t build an AI that replaces procurement. We built an AI that thinks like procurement — with 145 rules it can’t break, 8 escalation paths it must follow, and an audit trail that proves every decision.”


BONUS: CONTRADICTION HANDLING EXAMPLES (for Q&A depth)

Example 1 — Budget vs. Reality (REQ-c9f40c, Smartphones)

Example 2 — Insufficient Supplier Pool (REQ-000001, Consulting)

Example 3 — Past Deadline (REQ-000042, Cloud Compute)

Example 4 — Preferred Supplier Can’t Serve Region (REQ-000002, Cloud)

“When a request contradicts itself — budget too low, deadline passed, preferred supplier restricted — Sift doesn’t crash. It generates resolution options with trade-off analysis. That’s what a senior procurement specialist does. We automated that judgment.”