Jun 01, 2024

The GenAI Stack: What Enterprises Are Building On

Foundation model, orchestration, vector database, eval framework — a map of the GenAI stack choices enterprises made in 2024 and what held up into 2026.

In the first half of 2023, every conversation about enterprise generative AI bottlenecked on the same question: which foundation model should we use? By mid-2024, that question had largely been resolved — not because a single winner had emerged, but because the ecosystem had matured enough that the foundation model selection was no longer the hardest decision. The harder decisions were architectural: what does the full stack look like, how do the layers compose, and which choices would prove durable as the technology continued to evolve at pace?

We worked with clients across financial services, insurance, and enterprise software through 2024 as they made these architectural decisions. Here is a clear-eyed account of what the GenAI stack looked like by the time it had stabilised, and what has held up into 2026.

Layer 1: Foundation Models

The foundation model market in 2024 was characterised by rapid capability improvement and genuine differentiation across providers — not the commodity market that some analysts predicted.

GPT-4o (OpenAI, released May 2024) became the default choice for general enterprise use cases: strong across language tasks, multimodal (text, image, audio in the same model), relatively fast, and with a pricing structure that made production deployment tractable. The enterprise tier offered data processing agreements that satisfied most corporate data governance requirements.

Claude 3 Opus and Claude 3.5 Sonnet (Anthropic) differentiated primarily on context window length (200,000 tokens at launch) and performance on long-document tasks. For clients doing regulatory document analysis, contract review, or anything requiring sustained coherence across large inputs, Claude became a clear recommendation. Anthropic’s Constitutional AI training approach also produced measurably lower rates of certain refusal false positives that created friction in enterprise deployments.

Gemini 1.5 Pro (Google DeepMind) introduced a 1 million token context window and native multimodal capability that included video understanding. For clients working with mixed document types — scanned PDFs, embedded images, tables — Gemini’s architecture offered advantages that were not available elsewhere.

Llama 3 (Meta, released April 2024) changed the calculus for clients with data sovereignty requirements or the engineering capacity to run their own inference. An 8B parameter model that outperformed GPT-3.5 on most benchmarks, available for download and self-hosting, made on-premise LLM deployment viable for the first time at useful capability levels. For Southeast Asian clients subject to data residency requirements, Llama 3 opened a path to LLM deployment that did not require sending data to US-hosted cloud APIs.

Selection criteria that held up in practice: latency (the difference between 500ms and 2,000ms first-token response time is significant for user-facing applications), cost per token (input and output token costs vary by a factor of 10x or more across models; this dominates total infrastructure cost at scale), context window (determines how much content can be processed in a single call, directly affecting RAG architecture choices), fine-tuning availability (not all models offer fine-tuning through the API), and data residency (which regions process the API calls, and what contractual commitments exist around data handling).

Layer 2: Orchestration

The orchestration layer — the code that coordinates calls between foundation models, retrieval systems, tools, and downstream consumers — saw the most turbulence of any layer in 2024.

LangChain entered 2024 as the dominant framework, with extensive documentation, a large community, and integrations covering virtually every model and vector store. It also entered 2024 carrying significant technical debt: an abstraction layer that had grown complex enough to obscure what was actually happening in LLM calls, and a versioning history that made upgrading painful. Enterprise teams that had built heavily on LangChain in 2023 frequently spent significant engineering time in early 2024 debugging framework behaviour rather than application logic.

LlamaIndex took a different architectural direction: tighter focus on the data indexing and retrieval problem, with less emphasis on agent orchestration. For RAG-centric applications — document Q&A, knowledge base search, enterprise search — LlamaIndex’s data connectors and index types provided more direct tools than LangChain’s broader-but-shallower coverage.

By late 2024, a significant portion of enterprise engineering teams had moved away from heavy orchestration frameworks entirely. The pattern we observed most frequently among sophisticated teams: a thin custom orchestration layer built on the raw SDK of the foundation model provider (OpenAI Python SDK, Anthropic Python SDK), with hand-rolled logic for prompt management, context assembly, and retry handling. This is more code to write and maintain, but it is code the team understands completely — which matters when something breaks in production at 2am.

The lesson: orchestration frameworks are useful for exploration and prototyping. At production scale, the abstraction cost often outweighs the convenience benefit. Build on the primitives you understand.

Layer 3: Vector Databases

RAG requires a retrieval system. In 2024, that almost universally meant a vector database — a data store optimised for storing and querying high-dimensional vector embeddings, enabling semantic similarity search over large document corpora.

Pinecone was the managed default: straightforward API, no infrastructure to operate, and a pricing model that made it easy to start. For teams that wanted to ship quickly and were willing to pay for managed infrastructure, Pinecone was the path of least resistance. Its limitations — less flexibility on index configuration, vendor lock-in, and cost at high query volumes — became apparent at scale but rarely blocked initial deployment.

pgvector (a PostgreSQL extension) became the choice for teams with existing PostgreSQL infrastructure and the operational experience to manage their own databases. The appeal was clear: add vector similarity search to the database you already operate, query it with standard SQL, and avoid introducing a new infrastructure dependency. By late 2024, pgvector’s performance had improved substantially and it was handling production workloads at meaningful scale.

Weaviate offered hybrid search — combining vector similarity with keyword-based BM25 search — out of the box. For enterprise search applications where users might query with either conceptual language (“documents about customer refund policies”) or specific keywords (“refund policy section 4.2”), the hybrid approach consistently outperformed pure vector search. Weaviate was also available as a managed cloud service or self-hosted.

Qdrant positioned on performance: filtering capability (the ability to pre-filter a vector search by metadata before computing similarity, rather than computing similarity and then filtering) and throughput. For high-query-volume applications where latency mattered, Qdrant’s performance characteristics made it worth the additional operational complexity of self-hosting.

Selection criteria: query latency at your query volume (test under realistic load before committing), filtering capability (pre-filter vs. post-filter is architecturally significant), hybrid search support (often more important than clients expect), and hosted vs. self-managed (a function of your operational capacity and data residency requirements).

Layer 4: Evaluation

Evaluation was the most underinvested layer of the 2024 enterprise GenAI stack. Teams that shipped without it paid more later in incident response, user trust erosion, and undetected quality degradation.

RAGAS (Retrieval-Augmented Generation Assessment) provides a framework for evaluating RAG pipelines on metrics including faithfulness (does the generated answer accurately reflect the retrieved context?), answer relevancy (does the answer address the question?), and context precision (did the retrieval system return relevant passages?). Running RAGAS evaluations on a golden test set — a curated collection of question-answer pairs with known correct answers — is the minimum credible evaluation baseline for a RAG system going to production.

LLM-as-judge patterns use a capable LLM (typically GPT-4 or Claude) to evaluate the outputs of another LLM call, scoring for quality, relevance, and safety. This is not a substitute for human evaluation, but it scales where human evaluation does not — and with appropriate calibration against human raters, it produces useful signal.

Human evaluation protocols remain essential for high-stakes output categories. Structured spot-check programmes — where domain experts review a random sample of production outputs weekly — catch degradation that automated metrics miss and build the institutional knowledge needed to improve the system over time.

Layer 5: Observability

LLM calls are opaque by default. Tracing them — understanding which prompts were sent, what was retrieved, what the model returned, and how long each step took — is not optional in production.

LangSmith (built by the LangChain team) provided native tracing for LangChain-based applications. If you were using LangChain, LangSmith was the natural companion.

Langfuse emerged as the framework-agnostic alternative: an open-source LLM observability platform that could be self-hosted or used as a managed service, with integrations for most major models and orchestration approaches. For teams that had moved away from LangChain, Langfuse became the dominant observability choice.

Helicone took a proxy-based approach: route all LLM API calls through Helicone’s proxy, and get logging, cost tracking, and rate limiting without any SDK integration. Operationally simple and effective for teams that prioritised time-to-visibility over depth of tracing.

The minimum observability baseline for production: log every LLM call with its prompt, response, latency, token count, and cost. Alert on latency spikes and error rates. Retain logs long enough to diagnose issues reported after the fact.

What Has Held Up vs. What Changed

Looking back from 2026: the foundation model layer became more capable and cheaper, but the leading providers remained the same. Context windows grew dramatically — Claude 3.5 Sonnet’s 200K token context made certain retrieval patterns obsolete for smaller document sets, because you could simply include more content directly rather than retrieving it. Open-source models improved enough that Llama-based deployments became genuinely competitive with closed models on many tasks, accelerating their adoption among clients with data sovereignty requirements.

The orchestration layer thinned out further. Most serious production teams write their own orchestration, treating frameworks as sources of patterns to learn from rather than dependencies to build on.

The evaluation and observability layers remain the most underinvested, and the correlation between teams that invested heavily in them and teams with stable, reliable production systems is not coincidental.

The Durable Architectural Decisions

The GenAI stack stabilised enough in 2024 that the core architectural decisions — which layers to compose, where to build vs. buy, how to handle evaluation and observability — are now choices that will be durable. The specific tools within each layer will continue to evolve, but the structure of the stack is unlikely to change fundamentally.

For organisations making these decisions now: pick your foundation model based on the specific capability requirements of your use case, not brand or hype. Build your orchestration thin and on primitives you control. Invest in evaluation before you ship, not after. Instrument everything from day one. The teams that did this in 2024 are the ones running stable production systems in 2026.

Building a Production RAG System — A deep dive into the retrieval layer that sits at the core of most enterprise GenAI stacks.
LLM Gateway Architecture for Production — Understand the API gateway layer that manages cost, routing, and resilience across the stack.
Nematix Generative AI Services — How Nematix helps organisations make the right stack choices and implement them in production.

Learn how Nematix’s Innovation Engineering services help businesses build production-ready AI systems.