Smart routing
Every request to Prism names a model. What happens next depends on what
you put there.
auto, cascade, or an explicit upstream
Set model: "auto" and Prism decides which of your upstreams handles the
request, per request, before forwarding it — including for streaming
responses. Set model: "cascade" to walk your upstreams in a strict,
explicit order you control (see "Cascade mode" below). Set
model to the name you gave the upstream when you added it (e.g.
"ds" or "deepseek") — exactly as listed by /v1/models — and Prism
sends the request there directly, substituting the real provider model
itself, skipping the routing decision entirely. Use an explicit name when
you need a specific model's behaviour; use auto when you just want the
cheapest upstream that can do the job; use cascade when you want to
define the fallback order yourself.
How auto decides
Routing weighs two things: difficulty and cost. Prism looks at the request — its length, complexity signals, and requested capabilities — and estimates how demanding it is, then picks the cheapest upstream in your account that can handle it. A short factual question can go to your cheapest tier; a long, reasoning-heavy prompt gets bumped to a stronger model. You don't have to configure this per request — it's decided from what you send.
To make that call, Prism may issue a tiny (≤5-token) classification request
to your cheapest tier-1 upstream, using your key, to rate the difficulty of
what you sent — this happens only for model: "auto". That means your
prompt text reaches that tier-1 provider even on requests where a different
upstream ends up serving the actual response, and you may see a small
number of extra calls show up on that provider's bill.
Tiers
Each upstream you add gets a tier:
- Tier 1 — cheap. Your default for everyday, low-difficulty requests.
- Tier 2 — mid. A step up in capability and cost, used when tier 1 isn't a good fit.
- Tier 3 — strong. Your most capable (and usually most expensive) upstream, reserved for genuinely hard requests.
You set the tier when adding an upstream (or accept the preset's default).
auto only reaches for a higher tier when the request needs it.
Capability filtering
Before auto considers an upstream, it checks whether the upstream can
actually serve the request:
- Native tool/function calling — if your request includes tool definitions, only upstreams that support tool calling are eligible.
- Vision — image inputs route only to upstreams that accept them.
- Context length — Prism checks your request's token count against each upstream's context window and skips upstreams that can't fit it.
An upstream that's cheap but lacks a needed capability is simply excluded from consideration for that request — it won't be picked even if it's your lowest tier.
Cascade mode
Set model: "cascade" when you want explicit, deterministic control over
try-order instead of letting auto decide. Cascade skips the difficulty
classifier entirely and walks your upstreams strictly in layer order —
the priority number you set on each upstream from the
Upstreams page (lower number tried first; unnumbered
upstreams are used last, after all numbered layers).
curl https://prism-api.tyo.com.au/v1/chat/completions \
-H "Authorization: Bearer tyr-your-key-here" \
-H "Content-Type: application/json" \
-d '{"model": "cascade", "messages": [{"role": "user", "content": "Hello!"}]}'
A few things to know about v1 cascade behaviour:
- Escalation happens on errors only. A layer is skipped and the next one tried when it returns a retryable error — 5xx, timeouts, or rate limits. A deterministic 4xx (bad request, auth failure) does not cascade, since it would fail the same way at every layer.
- Capability filtering still applies. Layers that can't serve the
request — missing tool support, no vision, too small a context window —
are skipped, same as in
auto. - Streaming requests escalate pre-first-byte only. Once a layer has started streaming bytes back to you, Prism can't hand the request to the next layer mid-stream.
Cascade is deliberately simple in v1: it reacts to transport/HTTP failures, not to answer quality. Quality-based escalation — detecting a weak or truncated answer and retrying the next layer, eventually backed by a judge score — is on the roadmap as an opt-in feature, not the default.
How savings are measured
Your dashboard shows actual_cost (what you were charged, in aggregate,
across your upstreams) alongside savings. Savings are calculated against
what the same request would have cost had it gone to your most expensive
tier-3 upstream — the baseline is your own most expensive model, not a
generic market rate. If you've only added one upstream, savings will be
zero, since there's nothing cheaper to route to.