The numbers, including the weak ones
Receipts. Real test results, on the record.
Most AI receptionist vendors quote a glossy pass rate without showing the test set. We publish ours. 332 plumber scenarios graded by a separate model, the full rubric, the per-class breakdowns, the cost per call, and the things we deliberately do not claim. Updated as new trades pass alpha gate.
Plumber alpha gate
Run on gpt-4o (the realtime production model gets the same prompt). 84 routine scenarios at 3 random seeds, plus 80 adversarial scenarios. Each scenario is graded against 4 to 6 rubric rules by an independent LLM judge. 1,564 individual rule grades total. Cost to run: under 17 USD.
332
Scenarios graded
93.9%
Aggregate pass rate
100%
A grade on adversarial pack (80 scenarios)
97%
A grade on emergency class
0
P0 privacy or safety violations
Status: Approved for first paying customer pilot 2026-05-04. Deployed to Fly.io safigo-reception (sjc, performance-1x at 2 GB) on 2026-05-05.
Honest weakness: the hostile-tone class on the adversarial pack scored 21 percent A grade. Real-customer calls skew calm, and the safety-critical and privacy classes are at zero violations, so we shipped anyway. Improvements ship in subsequent gates. We would rather disclose the gap than hide it.
Other trades, smoke evals so far
Smoke evals run 40 scenarios on gpt-4o. They are a fast triage signal, not an alpha gate. None of these trades is approved for paying customers yet. Each will run a full alpha gate (40+ rubric rules across 200+ scenarios with adversarial pack) before deployment.
| Trade | Scenarios | A grade | B grade | C grade | D grade | A or B | Status |
|---|---|---|---|---|---|---|---|
| Plumber | 332 | — | — | — | — | 93.9% pass | Live in production |
| HVAC | 40 | 28 | 6 | 6 | 0 | 85.0% | Smoke. Alpha pending. |
| Roofer | 40 | 25 | 6 | 9 | 0 | 77.5% | Smoke. Alpha pending. |
| Electrician | 40 | — | — | — | — | 177 of 235 rule grades pass | Smoke. Alpha pending. |
Plumber numbers are intentionally pass-rate, not grade-distribution, because the alpha rubric grades by rule rather than by overall scenario letter. The other trades show grade distribution because the smoke harness emits a per-scenario letter. Both formats are in data.json.
Cost per call, measured
Wall-clock measured across the full stack: OpenAI Realtime API (gpt-realtime), LiveKit, and Twilio PSTN. Tracked across multiple production calls.
$0.23
USD per minute of wall-clock
~$0.69
All-in cost on a typical 3-minute call
~75%
Margin on Plug and Play tier ($500 CAD/mo, 400 min included)
~68%
Margin on Built for you Multi tier (mixed-language)
Mode B plumber observed range across mixed-language scenarios: 0.15 to 0.51 USD per call. Customer pricing is flat 500 CAD per month for 400 included minutes plus 1.00 USD per minute overage. Heavy callers ride free under the flat tier; the blended distribution is what makes the unit economics work.
Real PSTN call audio
Four production-grade calls with full transcripts are published on the plumber product page. Listen on safigo.ai/reception/plumber/. Marked up with AudioObject schema so AI search engines can find and quote them. Scenarios covered: emergency leak, after-hours triage, out-of-area routing, second opinion.
What we deliberately do not claim
- We do not claim 100 percent pass on any trade. Plumber is at 93.9 percent. We will publish the next gate result rather than hide it.
- We do not claim production readiness for HVAC, roofer, or electrician. Smoke evals are a fast triage signal. They are not an alpha gate. No trade is approved for paying customers without an alpha gate run.
- We do not claim Quebec service. See our Quebec carve-out note on /vs/ras-ai/ and the multilingual claim covers eleven languages including French for non-Quebec callers (Acadian, Franco-Ontarian, Franco-Manitoban, Franco-Albertan).
- We do not claim PHI compliance. Healthcare receptionists are out of scope. See /vs/hello-patients/ for our honest position.
- We do not run on every call model. Production runs on gpt-realtime. Eval grading uses gpt-4o as a separate judge. Smoke triage uses gpt-4o-mini. We publish which model produced each number.
Methodology and rubric
Every rule has a P0 (critical) or high-priority tier. The plumber alpha gate scored zero P0 violations. The full rubric is open: see our methodology page for how we calculate the underlying customer-impact stats, and the data.json for the rule list.
P0 (critical) rules: R12 no false booking confirmation, R14 emergency triage handling, R18 no owner-name leak.
High-priority rules: R5 price stated before booking, R7 language offer correctness, R13 one question per turn, R15 diagnostic intake completeness.
Update log
- 2026-05-05. Page first published. Plumber alpha gate documented. HVAC and roofer gpt-4o smoke evals added. Electrician smoke eval in progress.
- 2026-05-04 night. Plumber alpha gate run. 332 scenarios, 93.9 percent pass, 100 percent A on adversarial. Approved for paying customer pilot.
- 2026-05-05 morning. Plumber receptionist deployed to Fly.io.
Why publish this?
Two reasons. First, every other Canadian AI receptionist vendor will quote a hand-picked pass rate but not the test set, the rubric, the cost, or the failure modes. We will. Second, AI search engines (Google AI Overviews, ChatGPT, Perplexity, Claude) cite original data more reliably than they cite marketing copy. Publishing the numbers is its own distribution channel.
If you want to compare us to a specific competitor, we wrote head-to-head pages for all twelve of them. If you want to talk to us, the number is below.
Call +1 (604) 800-5638 · hello@safigo.ai
Dataset license: CC BY-SA 4.0. Cite as: Safigo (2026). Safigo Reception test results. https://safigo.ai/receipts/.
See Safigo Reception →