If you've been reading conference decks, you'd think every SaaS team in 2026 is running multi-agent orchestration over a serverless graph database with edge inference at every PoP. Reality is dirtier. Most teams have one model provider, a Postgres extension for vectors, a single agent loop with three tools, and a Datadog bill that quietly doubled this year.
This is the actual stack we see day-to-day at EBITA — what's stabilised, what's still duct tape, and where the spend is going.
The four-tier mental model
Stop thinking about "the AI stack" as a flat list. It separates cleanly into four tiers, and the boundaries between them are where the 2024-era hacks have either consolidated into products or quietly died.
Tier 1 — Inference. The model itself. Foundation models from Anthropic, OpenAI, Google, plus open-weight options from Meta, Mistral, and Alibaba. This tier is now a commodity-with-personality. You pick one as default, keep adapters for the others, and route by task.
Tier 2 — Retrieval. Where your facts live. Vector embeddings, plus the dirty secret that lexical search still wins for half of queries. Hybrid search has won.
Tier 3 — Orchestration. Where agents, tools, and workflows execute. This is the tier with the most chaos and the most opinion. LangChain, LangGraph, Mastra, Inngest, plus a long tail of custom code.
Tier 4 — Application. Your product UI and the user-facing behaviour. Streaming responses, tool-use feedback, evals running in the background, the cost meter on every call.
Read each tier separately when you're picking tools. Mistakes happen when teams pick "an AI framework" and end up with one vendor's opinion on all four tiers fused together.
Tier 1: Inference — stop running the bake-off forever
The model question is mostly settled. Pick one default — Anthropic's Claude family or OpenAI's GPT line for most production work, Gemini if you need long-context cheaply, an open-weight model behind your own GPU only if you have a privacy requirement that won't accept a BAA.
The pattern that keeps biting teams: running the bake-off every quarter and rewriting prompts each time. The model benchmarks have compressed. Differences on the metrics that matter to your product (your evals, not MMLU) are often inside the noise floor. Pick, ship, switch only when your evals tell you to — not when a competitor ships a new model.
What's actually changed in 2026: prompt caching is now first-class on all the big providers, which has knocked 40-70% off the cost of any RAG-style workload that reuses a system prompt. If you're not using it, that's a single afternoon's work for a fat margin win.
“We re-ran our model bake-off three times in 2025. Each time the winner moved by less than our prompt's own week-to-week variance. We finally accepted the cost-saving plan was 'pick one and stop benchmarking.'”
Staff Platform Engineer/Series B vertical SaaS
Tier 2: Retrieval — Postgres won, mostly
The vector-database wars are over and the winner is pgvector. Dedicated vector DBs (Pinecone, Weaviate, Qdrant) still have niches at very large scale, but for the 90% case — under 50 million vectors, predictable query patterns, mixed with relational data — running pgvector inside the same Postgres that holds the rest of your application data is the right answer. You skip an entire data-sync surface and your retrieval queries can join against users, documents, permissions in one shot.
Two things to actually do here:
Hybrid retrieval is the default. Pure vector search loses to BM25 on exact-match queries (product codes, names, error strings) every time. Combine: do BM25 + vector in parallel, fuse with reciprocal rank fusion or a simple linear combination, take top-k. The combined result beats either alone by a wide margin on most real corpora.
Chunking is still where most teams underinvest. The fashionable advice in 2024 was 512-token chunks with 50-token overlap. The fashionable advice in 2026 is structured chunking — by section, by sentence, by semantic break — with parent-document retrieval so the model sees enough context. Test it on your data; the right answer is corpus-specific.
-- Hybrid retrieval: BM25 + pgvector, fused with reciprocal rank fusion.-- ts_rank handles exact-match queries that pure vector search butchers;-- the embedding handles paraphrase and intent. RRF (k=60) blends them.WITH lexical AS ( SELECT id, ts_rank_cd(search_tsv, plainto_tsquery($1)) AS score, row_number() OVER (ORDER BY ts_rank_cd(search_tsv, plainto_tsquery($1)) DESC) AS rank FROM documents WHERE search_tsv @@ plainto_tsquery($1) LIMIT 50),semantic AS ( SELECT id, 1 - (embedding <=> $2) AS score, row_number() OVER (ORDER BY embedding <=> $2) AS rank FROM documents ORDER BY embedding <=> $2 LIMIT 50)SELECT id, SUM(1.0 / (60 + rank)) AS rrf_scoreFROM (SELECT id, rank FROM lexical UNION ALL SELECT id, rank FROM semantic) uGROUP BY idORDER BY rrf_score DESCLIMIT 10;Tier 3: Orchestration — where everyone is still arguing
This is the tier where teams burn the most engineering hours and where the tooling is least settled.
For workflows that are mostly deterministic — "fetch this, transform that, call this tool, return result" — you don't need an agent framework. You need a workflow engine. Inngest, Temporal, or a plain Postgres-backed job queue does the job, and your reliability story is dramatically better than "I hope the LLM picks the right tool."
For workflows that genuinely need agentic behaviour — open-ended planning, branching, dynamic tool selection — you need something with a typed graph and a debugger. LangGraph and Mastra are the two we see most often. Both are fine. Pick the one your team's TypeScript-vs-Python preference matches.
Model Context Protocol (MCP) has actually shipped in 2026. The "every tool I want my agent to use is its own integration nightmare" problem has eased. MCP servers for filesystem, browsers, Slack, Linear, GitHub, and most major SaaS are now stable enough to use in production. The win isn't just less glue code — it's that you can swap models without rewriting your tool layer.
Tier 4: Application — the streaming-and-evals layer
The user-facing tier is where the polish lives, and it's the part most teams underbuild. The cost of getting it wrong is a product that feels janky even when the model is fine.
- Stream tokens, always. No exceptions. Users tolerate slow if they see motion; they bounce on stalls.
- Stream tool calls. Show the agent's intermediate steps. Users trust progress they can see.
- Run evals in CI. Production AI features without evals are production code without tests — the failure mode is just slower to notice.
- Cost meter every call. Per-feature, per-tenant. The teams who skip this end up with a single Anthropic invoice that destroys margin for a quarter before anyone notices.
The eval part is where 2026 has matured most. Tools like Braintrust, LangSmith, Helicone, and Phoenix have made it possible to score outputs continuously, regression-test prompts, and ship changes with confidence instead of vibes.
“The team that taught me the most about AI products didn't build the smartest agent — they built the best per-feature cost dashboard. The day they could see margin per call, the architecture decisions wrote themselves.”
VP Engineering/AI-native fintech, ~80 engineers
What's *not* on the list
A short list of things that show up in pitch decks but don't show up in real production stacks of mid-stage SaaS startups we work with:
- Multi-agent swarms for typical SaaS workflows. They mostly add latency, cost, and failure modes for marginal gains. Single-agent + good tools wins.
- Fine-tuning for general behaviour. It's still useful for narrow tasks (classification, structured extraction at scale, format-conformance), but tuning a model for "follows our brand voice" is almost always worse than a good system prompt + a few-shot block.
- Self-hosted inference unless you have a real privacy or cost trigger. Run the math; for most teams the cost of GPU operations exceeds API costs by 3-5x once you include reliability, scaling, and on-call.
- Bespoke embedding models. The off-the-shelf options (OpenAI's
text-embedding-3-large, Cohere's English-v3, Voyage AI) cover almost every use case. Save the engineering for retrieval logic and chunking, not embeddings.
The shortest stack that works
Frequently asked questions
Is pgvector really enough at scale?
For most SaaS workloads — yes, up to tens of millions of vectors with sensible indexing (HNSW or IVFFlat depending on access patterns). The break point comes when you need sustained sub-50ms p99 over hundreds of millions of vectors with frequent updates. Most companies never see that.
Do we need fine-tuning to compete?
Almost certainly not. For 95% of product features, a well-crafted prompt with the latest frontier model out-performs a fine-tune of a smaller model on cost, accuracy, and maintenance. Fine-tune when you have a narrow, well-defined task with consistent inputs and a clear win you can measure.
LangChain or LangGraph?
LangGraph for new work. The graph model is the right abstraction for agentic loops; the original LangChain has too much accumulated surface area.
Multi-model routing — useful or premature optimisation?
Useful at scale, premature for most teams. The savings only materialise when you're spending >$10k/month on inference. Before that, the cost of building and maintaining the router exceeds the savings.
Where this leaves you
The right move for most product teams in 2026 is to resist novelty. The boring stack — one model provider, Postgres for everything including vectors, a workflow engine for deterministic flows, an agent framework only where genuinely needed, evals everywhere — is faster, cheaper, more reliable, and easier to hire for than the cutting edge.
The teams shipping the best AI features right now are not the ones with the most exotic stacks. They're the ones who picked early, kept moving, and let everyone else burn cycles on the latest demo.
Talk to us if you're rebuilding a product stack for AI-native and want a sanity check on the choices. The number of decisions that genuinely matter is smaller than the LinkedIn timeline suggests.



