Smart routing

Every request to Prism names a model. What happens next depends on what you put there.

auto, cascade, or an explicit upstream

Set model: "auto" and Prism decides which of your upstreams handles the request, per request, before forwarding it — including for streaming responses. Set model: "cascade" to walk your upstreams in a strict, explicit order you control (see "Cascade mode" below). Set model to the name you gave the upstream when you added it (e.g. "ds" or "deepseek") — exactly as listed by /v1/models — and Prism sends the request there directly, substituting the real provider model itself, skipping the routing decision entirely. Use an explicit name when you need a specific model's behaviour; use auto when you just want the cheapest upstream that can do the job; use cascade when you want to define the fallback order yourself.

How auto decides

Routing weighs two things: difficulty and cost. Prism looks at the request — its length, complexity signals, and requested capabilities — and estimates how demanding it is, then picks the cheapest upstream in your account that can handle it. A short factual question can go to your cheapest tier; a long, reasoning-heavy prompt gets bumped to a stronger model. You don't have to configure this per request — it's decided from what you send.

To make that call, Prism may issue a tiny (≤5-token) classification request to your cheapest tier-1 upstream, using your key, to rate the difficulty of what you sent — this happens only for model: "auto". That means your prompt text reaches that tier-1 provider even on requests where a different upstream ends up serving the actual response, and you may see a small number of extra calls show up on that provider's bill.

Tiers

Each upstream you add gets a tier:

  • Tier 1 — cheap. Your default for everyday, low-difficulty requests.
  • Tier 2 — mid. A step up in capability and cost, used when tier 1 isn't a good fit.
  • Tier 3 — strong. Your most capable (and usually most expensive) upstream, reserved for genuinely hard requests.

You set the tier when adding an upstream (or accept the preset's default). auto only reaches for a higher tier when the request needs it.

Capability filtering

Before auto considers an upstream, it checks whether the upstream can actually serve the request:

  • Native tool/function calling — if your request includes tool definitions, only upstreams that support tool calling are eligible.
  • Vision — image inputs route only to upstreams that accept them.
  • Context length — Prism checks your request's token count against each upstream's context window and skips upstreams that can't fit it.

An upstream that's cheap but lacks a needed capability is simply excluded from consideration for that request — it won't be picked even if it's your lowest tier.

Cascade mode

Set model: "cascade" when you want explicit, deterministic control over try-order instead of letting auto decide. Cascade skips the difficulty classifier entirely and walks your upstreams strictly in layer order — the priority number you set on each upstream from the Upstreams page (lower number tried first; unnumbered upstreams are used last, after all numbered layers).

curl https://prism-api.tyo.com.au/v1/chat/completions \
  -H "Authorization: Bearer tyr-your-key-here" \
  -H "Content-Type: application/json" \
  -d '{"model": "cascade", "messages": [{"role": "user", "content": "Hello!"}]}'

A few things to know about v1 cascade behaviour:

  • Escalation happens on errors only. A layer is skipped and the next one tried when it returns a retryable error — 5xx, timeouts, or rate limits. A deterministic 4xx (bad request, auth failure) does not cascade, since it would fail the same way at every layer.
  • Capability filtering still applies. Layers that can't serve the request — missing tool support, no vision, too small a context window — are skipped, same as in auto.
  • Streaming requests escalate pre-first-byte only. Once a layer has started streaming bytes back to you, Prism can't hand the request to the next layer mid-stream.

Cascade is deliberately simple in v1: it reacts to transport/HTTP failures, not to answer quality. Quality-based escalation — detecting a weak or truncated answer and retrying the next layer, eventually backed by a judge score — is on the roadmap as an opt-in feature, not the default.

How savings are measured

Your dashboard shows actual_cost (what you were charged, in aggregate, across your upstreams) alongside savings. Savings are calculated against what the same request would have cost had it gone to your most expensive tier-3 upstream — the baseline is your own most expensive model, not a generic market rate. If you've only added one upstream, savings will be zero, since there's nothing cheaper to route to.