Large language models don't do much on their own. A model sitting behind an API is potential, not capability. What converts that potential into something useful—something that drafts, classifies, summarizes, retrieves, routes, and monitors at production scale—is the tooling layer wrapped around it. Choose that layer well and you multiply what your team can build. Choose it poorly and you spend months debugging integrations, managing cost overruns, or rebuilding from scratch when requirements change.
This survey covers the major categories of large language models tools, what each category actually does, how to evaluate options within it, and where teams typically go wrong. It's organized by function rather than by vendor, because the honest problem most teams face is that they don't yet know what kind of tool they need—they just know they need something. The goal here is to close that gap.
If you're earlier in the process—still deciding which model to build on, or whether to build at all—Getting Started with Large Language Models is the right place to begin. This article assumes you've cleared that hurdle and you're now choosing your stack.
The Five Functional Layers of LLM Tooling
Every serious LLM deployment involves roughly five layers: the model access layer, orchestration, retrieval and memory, evaluation, and observability. Teams that treat these as one undifferentiated "AI stack" end up with fragile systems. Teams that consciously choose and own each layer build things that last.
Understanding the layers doesn't mean you need a separate tool for each one. Some platforms compress multiple layers into one interface. That compression is sometimes worth the convenience and sometimes a trap. The rest of this article walks through each layer so you can make that call deliberately.
Model Access and API Management
The first tooling decision is how your applications talk to models. The options are direct API calls, managed inference platforms, and self-hosted models.
Direct API Access
OpenAI, Anthropic, Google (Gemini), Mistral, and Cohere all offer API access with per-token pricing. Direct access is the right starting point for most teams: fast to integrate, no infrastructure to manage, and easy to swap models when you want to run comparisons. The trade-offs are cost unpredictability at scale, latency that depends on shared infrastructure, and data-handling agreements that some regulated industries can't accept.
Managed Inference Platforms
Platforms like AWS Bedrock, Azure AI Studio, and Google Vertex AI let you access multiple models through a single cloud interface with enterprise SLAs, private networking, and billing tied to your existing cloud contract. They add latency configuration, reserved capacity options, and compliance controls. The cost per token is often similar to or slightly above direct API pricing, but the operational overhead drops substantially for teams already in those cloud ecosystems.
Self-Hosted and Open-Weight Models
Running models like Llama 3, Mistral Large, or Qwen2.5 on your own infrastructure (via vLLM, Ollama, or Together AI's inference API) gives you full data control, fixed-cost economics at high volume, and the ability to fine-tune without sending proprietary data to a third party. The real cost is engineering time: serving, scaling, and updating open-weight models requires infrastructure expertise that most agency teams don't have or don't want to maintain.
A practical rule: start with direct API access, migrate to managed cloud when compliance or volume demands it, and evaluate self-hosting only when you're processing enough tokens that the economics clearly favor it—typically somewhere north of several hundred million tokens per month.
Orchestration Frameworks
Orchestration tools manage the logic that happens between a user request and a model response: routing to the right model, chaining prompts, calling tools, managing state, and handling errors. This is where most of the actual application logic lives.
LangChain and LangGraph
LangChain is the most widely adopted orchestration framework and has the largest library of pre-built integrations (vector stores, document loaders, output parsers, tool connectors). LangGraph, built on top of it, adds graph-based state management for complex multi-step agents. The honest critique of LangChain is that its abstraction layers can obscure what's actually happening—useful for rapid prototyping, occasionally frustrating when you need to debug production failures or optimize for cost. Teams with strong Python backgrounds often hit its ceilings and reach for more explicit control.
LlamaIndex
LlamaIndex is specialized for retrieval-augmented generation (RAG) pipelines. Where LangChain is broad, LlamaIndex is deep—it has more sophisticated document indexing, retrieval strategies, and query pipeline primitives. If your application is primarily about making a model "know things" from your documents, LlamaIndex is worth serious consideration.
Direct SDK and Custom Orchestration
For production systems with unusual requirements, some teams skip frameworks entirely and orchestrate directly against model SDKs with custom routing logic. This trades ecosystem convenience for full visibility and control. It's more work upfront and harder to onboard new developers, but teams that do it well typically cite better reliability, easier debugging, and more predictable costs. The Anthropic SDK, OpenAI Python library, and Instructor (a library for structured outputs) are common building blocks for this approach.
Retrieval-Augmented Generation and Memory
RAG is the most broadly useful pattern in production LLM applications: rather than expecting the model to know everything, you retrieve relevant context at runtime and include it in the prompt. The tooling here spans vector databases, embedding models, and memory systems.
Vector Databases
The main options are Pinecone (fully managed, simple to start), Weaviate and Qdrant (open-source with self-hosting options and richer filtering), pgvector (a PostgreSQL extension that lets teams avoid a separate database), and Chroma (lightweight, often used for local development). The selection criteria that actually matter: latency at your query volume, metadata filtering capabilities, ease of updating and deleting vectors, and whether your team wants to manage additional infrastructure.
Embedding Models
The embedding model converts text into vectors for retrieval. OpenAI's text-embedding-3-small and text-embedding-3-large are the most common starting points. Cohere and Voyage AI offer embeddings with strong multilingual and domain-specific performance. For self-hosted retrieval, models like nomic-embed-text or BGE (from BAAI) are widely used. Embedding quality matters more than people expect: a better embedding model often improves retrieval more than a better generative model.
Memory and Context Management
For applications that need to remember across conversations or sessions, tools like Mem0 and Zep provide persistent memory layers that slot into LLM pipelines. For simpler cases, many teams manage conversation history directly—but as context windows grow (128K and longer is now common), thoughtful context management becomes a first-class engineering concern, not an afterthought.
Evaluation Frameworks
Evaluation is where most teams underinvest. Building without evaluation infrastructure means you're flying blind on quality, and you'll discover regressions in production rather than in testing. See How to Measure Large Language Models: Metrics That Matter for a full treatment of what to measure—this section focuses on the tools.
Reference-Based vs. LLM-as-Judge Evaluation
Reference-based evaluation compares outputs to known-good answers using exact match, BLEU, ROUGE, or semantic similarity. It's reliable and cheap but requires labeled data. LLM-as-judge evaluation uses a strong model (often GPT-4o or Claude 3.5 Sonnet) to score outputs against criteria—useful when you don't have ground-truth labels, but it introduces its own biases and cost.
Evaluation Platforms
Braintrust and Confident AI offer end-to-end evaluation pipelines with dataset management, automated scoring, and human-review workflows. Ragas is purpose-built for RAG evaluation, measuring retrieval precision, answer faithfulness, and context relevance. PromptFoo is a lightweight open-source tool that runs prompt comparisons and regression tests in CI—well-suited for teams that want eval integrated into their deployment pipeline without a heavyweight platform.
The right choice depends on team maturity: PromptFoo is a good entry point; Braintrust or Confident AI make sense once you're running systematic experiments or managing multiple applications.
Observability and Cost Management
Once you're in production, you need visibility into what the model is actually doing, what it's costing, and where it's failing. Generic application monitoring (Datadog, Grafana) captures infrastructure metrics but not LLM-specific telemetry like prompt tokens, completion tokens, latency by model, tool call success rates, or user satisfaction signals.
LLM Observability Platforms
LangSmith (from the LangChain team) provides tracing for LangChain applications with prompt inspection, run trees, and cost tracking. Helicone is model-agnostic, lightweight, and popular with teams that want observability without framework lock-in—it sits as a proxy between your application and the model API. Phoenix (from Arize) and Langfuse are strong alternatives with active open-source communities. Portkey adds cost routing and fallback logic on top of observability, letting you automatically switch to a cheaper or faster model when conditions are met.
Cost management deserves its own emphasis. LLM costs scale non-linearly as you add context, run longer chains, or add agentic loops. Teams routinely discover 3–10× cost surprises between prototype and production. Implementing caching (semantic caching with tools like GPTCache or Momento), prompt compression, and model-routing rules (use a cheaper model for classification, a stronger model only for generation) typically reduces production costs by 40–70% without meaningful quality loss. Understanding those economics is central to The ROI of Large Language Models: Building the Business Case.
Prompt Management and Experimentation
Prompts are code. They version, they break, and they need to be tested before deployment. Despite this, many teams manage prompts as strings scattered across application code, which makes systematic improvement nearly impossible.
Langfuse, PromptLayer, and Braintrust all offer prompt management features: versioned prompt storage, A/B testing, and production deployment without code changes. For smaller teams, a structured prompt registry in a Git repo with a lightweight review process is often enough. The key discipline is that prompt changes should go through the same review and testing cycle as any other code change—not be shipped live because they're "just text."
Selection Criteria: How to Choose Your Stack
The tool landscape changes faster than any survey can track, so here are the durable selection criteria rather than a ranked list.
Integration surface: Does the tool connect cleanly to what you already use? A great tool with a painful integration adds more risk than it removes.
Visibility and debuggability: Can you see what the tool is actually doing? Black-box convenience is a liability in production.
Vendor risk: Is the tool open-source, or does a pricing change or shutdown break your application? Weight this more heavily for tools that sit in the critical path.
Team capability match: A tool your team can operate well beats an architecturally superior tool your team treats as a mystery. See the broader Large Language Models: Trade-offs, Options, and How to Decide for a framework for making these judgment calls across your whole stack.
Cost structure: Understand not just the current price but how cost scales with volume and with feature usage. Many platforms charge per seat, per API call, and for premium features simultaneously.
The trend worth watching: as model providers build more native tooling (OpenAI's Assistants API, Anthropic's tool use, Google's Vertex AI Agent Builder), the boundary between model and orchestration layer is blurring. That's covered in more depth in Large Language Models: Trends and What to Expect in 2026. The practical implication now is to avoid building deep dependencies on orchestration abstractions that model providers might commoditize within your product's lifetime.
Frequently Asked Questions
Do I need all five tooling layers from day one?
No. Start with model access, a minimal orchestration approach (even direct SDK calls), and basic logging. Add retrieval when your application needs external knowledge, add formal evaluation when you're iterating on quality, and add full observability when you're in production with real users. Premature tooling complexity is a common failure mode.
Is LangChain still worth using in 2025?
Yes, with caveats. LangChain's integrations library is genuinely useful, and LangGraph has become a serious option for complex agent workflows. The abstraction layers can create debugging headaches in production, so teams often use LangChain to prototype, then decide whether to keep it or replace it with more explicit orchestration once they understand the system's real requirements.
What's the difference between an orchestration framework and a model API?
The model API is the endpoint that generates text. The orchestration framework manages everything else: which model to call, in what sequence, with what context, using what tools, with what error handling. Most real applications need both.
How much should I budget for LLM tooling beyond model costs?
For early-stage applications, tooling costs are often negligible—most evaluation and observability tools have free tiers. At production scale, plan for tooling to add 10–25% on top of model inference costs. That ratio improves as volume grows, since model costs scale with usage while many tooling costs have flat or seat-based components.
When does self-hosting a model make economic sense?
When your inference volume is high enough and consistent enough that dedicated GPU capacity costs less than per-token API pricing, and when your team has the infrastructure expertise to operate it reliably. For most agency teams, this threshold is higher than it appears—factor in engineering time, availability SLAs, and the cost of model updates before concluding that self-hosting saves money.
How do I avoid vendor lock-in in this tooling layer?
Prefer tools with open standards or export capabilities, keep your prompt logic and business logic decoupled from any single framework's abstractions, and design your integration points so that swapping a tool doesn't require rebuilding the whole application. Abstraction layers help here when done deliberately; they hurt when they obscure the underlying mechanics.
Key Takeaways
- LLM tooling falls into five functional layers: model access, orchestration, retrieval/memory, evaluation, and observability. Understand each before choosing tools.
- Start simple: direct API access, minimal orchestration, and basic logging cover most early-stage needs. Add layers as requirements become clear.
- Retrieval-augmented generation is the most broadly applicable production pattern; investing in good embedding models and retrieval infrastructure pays off quickly.
- Evaluation is systematically underinvested. Build it early with lightweight tools; upgrade to platforms as your application matures.
- Cost surprises are common between prototype and production. Semantic caching, model routing, and prompt compression typically reduce costs 40–70% without quality loss.
- Select tools by integration fit, debuggability, vendor risk, and team capability—not by feature lists alone.
- The model-orchestration boundary is blurring as providers build native tooling. Avoid deep framework dependencies that providers are likely to commoditize.