What Foundation Models Actually Are, for People Who Build

Foundation models are the infrastructure layer of modern AI. They are trained once, at enormous scale, and then adapted to thousands of downstream tasks — which makes understanding them one of the highest-leverage things a professional or agency operator can do right now. Whether you are evaluating vendors, building AI-assisted workflows, or advising clients on adoption, you need a clear mental model of what these systems are, how they work, and where they break.

This guide covers everything that matters: the architecture, the training process, the major model families, the risks, and the practical decisions you will face when putting foundation models to work. It is deliberately structured so you can read it end-to-end or jump to the section most relevant to your current problem.

The term "foundation model" was coined by researchers at Stanford in 2021 to describe a new paradigm: instead of training a bespoke model for each task, you train one massive model on broad data and then adapt it. That framing changed how the industry thinks about AI development, procurement, and risk — and it is still the dominant paradigm today.

What a Foundation Model Actually Is

A foundation model is a large machine learning model trained on a wide, general dataset, designed to be adapted — through fine-tuning, prompting, or additional training — to a range of specific tasks. The defining characteristics are scale, generality, and adaptability.

Scale means billions (sometimes trillions) of parameters and training datasets measured in terabytes or petabytes. Generality means the model was not built for one narrow application; it learned broadly enough to transfer. Adaptability means the same base model can power a customer service chatbot, a code assistant, a document summarizer, and a legal research tool.

The Key Distinction from Traditional ML

In traditional machine learning, you collect labeled data for a specific task, train a model on that data, and deploy it for that task only. Change the task, start over. Foundation models invert this. You train once at massive cost and then adapt relatively cheaply. This changes the economics of AI dramatically — which is why it matters to anyone making build-versus-buy decisions or selecting vendors.

If you are building foundational knowledge about ML before diving into this layer, Machine Learning Basics as a Career Skill: Why It Matters and How to Build It is a useful primer.

The Architecture Behind the Power

Almost every major foundation model today is built on the transformer architecture, introduced by Google researchers in 2017. Understanding transformers at a conceptual level is worth the effort.

Self-Attention

Transformers process input by letting every token (word, subword, or pixel patch) attend to every other token in a sequence. This "self-attention" mechanism allows the model to capture long-range dependencies — the connection between a pronoun and the noun it refers to three paragraphs earlier, for instance. Earlier architectures like recurrent neural networks had to process tokens sequentially and struggled with long contexts. Transformers process them in parallel, which is why they can be trained at scale.

Pretraining Objectives

During pretraining, models learn to predict. For language models, the most common objectives are:

Causal language modeling: Predict the next token given all previous tokens. GPT-style models use this.
Masked language modeling: Randomly mask tokens and predict what they are. BERT-style models use this.
Sequence-to-sequence: Encode an input sequence and decode an output sequence. T5 and related models use this.

Vision models use analogous objectives — predicting masked image patches, or contrastive learning across image-text pairs. Multimodal models like GPT-4V combine objectives across modalities.

Major Foundation Model Families

The landscape is large but navigable once you understand the main branches.

Large Language Models (LLMs)

LLMs are the most commercially prominent category. Key families include:

GPT series (OpenAI): GPT-3, GPT-4, GPT-4o. General-purpose, strong at instruction-following and reasoning.
Claude series (Anthropic): Known for longer context windows and a strong focus on safety alignment.
Gemini (Google DeepMind): Natively multimodal from training, integrated tightly with Google infrastructure.
Llama series (Meta): Open-weight models that can be downloaded and run locally or fine-tuned freely — important for organizations with privacy or cost constraints.
Mistral / Mixtral: Efficient open-weight models; Mixtral uses a mixture-of-experts architecture to achieve high performance at lower inference cost.

Vision and Multimodal Models

CLIP (OpenAI): Trained on image-text pairs, used for zero-shot image classification and search.
DALL-E, Midjourney, Stable Diffusion: Image generation models, increasingly used in content and creative workflows.
Whisper (OpenAI): Speech-to-text foundation model, open-weight, extremely practical for transcription tasks.

Code Models

Models like GitHub Copilot (powered by Codex/GPT-4) and Code Llama specialize in code generation, completion, and explanation. For agency operators, these are often the fastest path to measurable productivity gains.

How Adaptation Works

Pretraining is only the first step. The base model is general; your use case is specific. Adaptation closes that gap.

Prompting and In-Context Learning

The cheapest form of adaptation. You provide instructions, examples, or context in the prompt itself, and the model generalizes from them. Few-shot prompting — giving 3–5 examples of the desired input-output format — typically outperforms zero-shot on structured tasks. No training required, no model weights changed.

Fine-Tuning

You take a pretrained model and continue training it on a smaller, task-specific dataset. The model's weights update to better reflect your domain or style. Full fine-tuning is expensive; parameter-efficient methods like LoRA (Low-Rank Adaptation) make fine-tuning feasible on modest hardware by updating only a small fraction of parameters. Typical use cases: adapting tone, learning proprietary terminology, improving performance on a narrow task class.

Retrieval-Augmented Generation (RAG)

Rather than baking knowledge into weights, RAG retrieves relevant documents at inference time and injects them into the prompt. This is the right choice when your data changes frequently or when you need cited, traceable outputs. RAG does not require retraining, making it fast to implement and update.

Reinforcement Learning from Human Feedback (RLHF)

This is how models like ChatGPT and Claude were aligned to be helpful and safer. Human raters rank model outputs; those preferences train a reward model; the reward model guides further policy training. RLHF is expensive and requires significant human annotation, making it the province of model developers rather than downstream users — but understanding it helps you reason about model behavior and its failure modes.

Evaluating Foundation Models

Choosing a model is a purchasing and engineering decision, not just a technical one. Evaluate on dimensions that actually matter to your use case.

Performance on Your Task

General benchmarks (MMLU, HumanEval, HellaSwag) are useful for rough comparisons but rarely predictive of production performance on specific tasks. Build a small, representative evaluation set from your own data. Test 50–100 representative inputs with defined quality criteria. This is the only reliable signal.

Context Window

Context window size determines how much text the model can process in one call. Windows range from 4K tokens (older models) to 1M tokens (Gemini 1.5 Pro). Larger windows enable processing of long documents, multi-turn conversations, and complex structured inputs — but larger inputs cost more per call.

Cost and Latency

API pricing is typically per million input and output tokens. At scale, these costs compound quickly. A workflow calling GPT-4 for every customer email at $15/million output tokens looks different at 10,000 emails/month than at 10 million. Latency matters for real-time applications; batch workloads have more flexibility.

Open vs. Closed Weights

Closed-weight models (GPT-4, Claude) offer strong performance and ongoing improvement but require API access, which means data leaves your infrastructure. Open-weight models (Llama 3, Mistral) can be deployed on-premises, giving you data sovereignty and predictable costs — at the expense of infrastructure overhead.

Risks You Need to Understand

Foundation models inherit and amplify certain failure modes. Understanding them is not optional for anyone deploying these systems professionally. A fuller treatment of these failure patterns is in The Hidden Risks of Machine Learning Basics (and How to Manage Them).

Hallucination

Models generate plausible-sounding but false content. This is a structural property of how they work, not a bug that will be fully patched. Mitigation: RAG for factual tasks, human review for high-stakes outputs, confidence estimation, and explicit prompting to say "I don't know."

Bias and Representation

Training data reflects historical distributions. Models trained on internet text reproduce the biases embedded in that text. For applications touching hiring, credit, content moderation, or healthcare, this requires active evaluation and mitigation, not just awareness.

Prompt Injection

Malicious input — in a document the model reads, a web page it browses, or a user message — can hijack model behavior by overriding instructions. This is an active security concern for any agentic or document-processing workflow.

Data Privacy

Sending data to third-party APIs means that data is transmitted to and processed by external infrastructure. Review provider data retention and training policies carefully. For regulated industries or sensitive client data, open-weight on-premise deployment is often the right answer.

Putting Foundation Models to Work in an Agency Context

For agency operators, the highest-value applications typically fall into three categories: content production, client-facing analysis, and internal process automation.

Content production — drafts, summaries, repurposing, translation — is the easiest entry point because quality is easy to evaluate and errors are low-stakes. Analysis tasks — extracting insights from documents, classifying support tickets, synthesizing research — benefit most from RAG architectures. Process automation — routing, classification, form filling — benefits most from fine-tuned or carefully prompted smaller models with deterministic guardrails.

When rolling out these capabilities across a team, start with constrained, high-volume tasks where the cost of errors is low and the benefit is measurable. Rolling Out Machine Learning Basics Across a Team covers the change-management dimension of this in detail.

The common mistake is starting with the model and working backward to the use case. Start with a concrete, measurable workflow problem. Then select the model and adaptation strategy that fits the cost, latency, privacy, and quality requirements of that specific problem.

Frequently Asked Questions

What is the difference between a foundation model and a large language model?

A large language model (LLM) is a type of foundation model specialized for text. "Foundation model" is the broader category that includes language, vision, audio, and multimodal models — any large model trained on broad data and designed for adaptation. All LLMs are foundation models, but not all foundation models are LLMs.

Are open-weight models as good as closed models like GPT-4?

On many tasks, recent open-weight models like Llama 3 70B or Mistral Large are competitive with GPT-3.5-class models and close the gap with GPT-4 on structured, well-defined tasks. For complex reasoning, long-context understanding, and instruction-following on ambiguous tasks, frontier closed models still tend to lead. The right choice depends on your task, budget, and data privacy requirements — not a universal ranking.

How much does it cost to fine-tune a foundation model?

Parameter-efficient fine-tuning with LoRA on a 7B-parameter model can run under $100 on a cloud GPU for a moderate dataset. Full fine-tuning of a 70B model on proprietary hardware can run into tens of thousands of dollars. For most agency use cases, prompt engineering and RAG deliver better ROI than fine-tuning, which should be reserved for situations where those approaches have clearly hit a ceiling.

Will foundation models become obsolete quickly?

Individual model versions do iterate rapidly — frontier model capabilities roughly double every 12–18 months by common benchmarks. But the architectural paradigm (transformer-based, large-scale pretraining, adaptation) has been stable for several years and shows no imminent signs of displacement. Investing in understanding how these systems work is durable knowledge; betting on any single provider or model version is not.

What is a "model" versus a "system" built on a model?

A model is the trained artifact — a set of weights that transforms inputs to outputs. A system is the model plus everything around it: the prompts, retrieval pipelines, guardrails, APIs, user interfaces, and orchestration logic. Most production AI products are systems. The model is one component. Understanding this distinction helps you debug failures and allocate improvement effort correctly.

How do I know if a foundation model is appropriate for my use case?

A foundation model fits well when the task requires language understanding, generation, or classification across variable inputs; when labeled training data is scarce; and when the use case would benefit from general world knowledge. It fits poorly when you need guaranteed deterministic outputs, when inference cost at scale is prohibitive, or when a simpler rule-based or classical ML solution already works reliably. If you are still calibrating your intuitions about when ML applies, Machine Learning Basics: The Questions Everyone Asks, Answered provides a useful framework.

Key Takeaways

Foundation models are large, general-purpose models trained at scale and adapted to specific tasks — they are the infrastructure layer of modern AI.
The transformer architecture, specifically self-attention, is what makes scale and generality possible.
The main model families are LLMs, vision models, multimodal models, and code models — each with open-weight and closed-weight variants.
Adaptation methods range from free (prompting) to moderately expensive (RAG, LoRA fine-tuning) to very expensive (full fine-tuning, RLHF) — match the method to the business case.
Hallucination, bias, prompt injection, and data privacy are structural risks, not edge cases — build mitigation into your workflows from the start.
For agency operators, the right entry point is a concrete, measurable workflow problem — not a model — and the evaluation should be grounded in your own representative data, not generic benchmarks.
Open-weight models offer data sovereignty and cost predictability; closed-weight frontier models offer peak performance and ongoing improvement. Know which trade-off matters more for each use case.

What a Foundation Model Actually Is

The Key Distinction from Traditional ML

If you are building foundational knowledge about ML before diving into this layer, Machine Learning Basics as a Career Skill: Why It Matters and How to Build It is a useful primer.

The Architecture Behind the Power

Almost every major foundation model today is built on the transformer architecture, introduced by Google researchers in 2017. Understanding transformers at a conceptual level is worth the effort.

Self-Attention

Pretraining Objectives

During pretraining, models learn to predict. For language models, the most common objectives are:

Causal language modeling: Predict the next token given all previous tokens. GPT-style models use this.
Masked language modeling: Randomly mask tokens and predict what they are. BERT-style models use this.
Sequence-to-sequence: Encode an input sequence and decode an output sequence. T5 and related models use this.

Vision models use analogous objectives — predicting masked image patches, or contrastive learning across image-text pairs. Multimodal models like GPT-4V combine objectives across modalities.

Major Foundation Model Families

The landscape is large but navigable once you understand the main branches.

Large Language Models (LLMs)

LLMs are the most commercially prominent category. Key families include:

GPT series (OpenAI): GPT-3, GPT-4, GPT-4o. General-purpose, strong at instruction-following and reasoning.
Claude series (Anthropic): Known for longer context windows and a strong focus on safety alignment.
Gemini (Google DeepMind): Natively multimodal from training, integrated tightly with Google infrastructure.
Llama series (Meta): Open-weight models that can be downloaded and run locally or fine-tuned freely — important for organizations with privacy or cost constraints.
Mistral / Mixtral: Efficient open-weight models; Mixtral uses a mixture-of-experts architecture to achieve high performance at lower inference cost.

Vision and Multimodal Models

CLIP (OpenAI): Trained on image-text pairs, used for zero-shot image classification and search.
DALL-E, Midjourney, Stable Diffusion: Image generation models, increasingly used in content and creative workflows.
Whisper (OpenAI): Speech-to-text foundation model, open-weight, extremely practical for transcription tasks.

Code Models

How Adaptation Works

Pretraining is only the first step. The base model is general; your use case is specific. Adaptation closes that gap.

Prompting and In-Context Learning

Fine-Tuning

Retrieval-Augmented Generation (RAG)

Reinforcement Learning from Human Feedback (RLHF)

Evaluating Foundation Models

Choosing a model is a purchasing and engineering decision, not just a technical one. Evaluate on dimensions that actually matter to your use case.

Performance on Your Task

Context Window

Cost and Latency

Open vs. Closed Weights

Risks You Need to Understand

Hallucination

Bias and Representation

Prompt Injection

Data Privacy

Putting Foundation Models to Work in an Agency Context

For agency operators, the highest-value applications typically fall into three categories: content production, client-facing analysis, and internal process automation.

Frequently Asked Questions

What is the difference between a foundation model and a large language model?

Are open-weight models as good as closed models like GPT-4?

How much does it cost to fine-tune a foundation model?

Will foundation models become obsolete quickly?

What is a "model" versus a "system" built on a model?

How do I know if a foundation model is appropriate for my use case?

Key Takeaways

Foundation models are large, general-purpose models trained at scale and adapted to specific tasks — they are the infrastructure layer of modern AI.
The transformer architecture, specifically self-attention, is what makes scale and generality possible.
The main model families are LLMs, vision models, multimodal models, and code models — each with open-weight and closed-weight variants.
Adaptation methods range from free (prompting) to moderately expensive (RAG, LoRA fine-tuning) to very expensive (full fine-tuning, RLHF) — match the method to the business case.
Hallucination, bias, prompt injection, and data privacy are structural risks, not edge cases — build mitigation into your workflows from the start.
For agency operators, the right entry point is a concrete, measurable workflow problem — not a model — and the evaluation should be grounded in your own representative data, not generic benchmarks.
Open-weight models offer data sovereignty and cost predictability; closed-weight frontier models offer peak performance and ongoing improvement. Know which trade-off matters more for each use case.

What Foundation Models Actually Are, for People Who Build

What a Foundation Model Actually Is

The Key Distinction from Traditional ML

The Architecture Behind the Power

Self-Attention

Pretraining Objectives

Major Foundation Model Families

Large Language Models (LLMs)

Vision and Multimodal Models

Code Models

How Adaptation Works

Prompting and In-Context Learning

Fine-Tuning

Retrieval-Augmented Generation (RAG)

Reinforcement Learning from Human Feedback (RLHF)

Evaluating Foundation Models

Performance on Your Task

Context Window

Cost and Latency

Open vs. Closed Weights

Risks You Need to Understand

Hallucination

Bias and Representation

Prompt Injection

Data Privacy

Putting Foundation Models to Work in an Agency Context

Frequently Asked Questions

What is the difference between a foundation model and a large language model?

Are open-weight models as good as closed models like GPT-4?

How much does it cost to fine-tune a foundation model?

Will foundation models become obsolete quickly?

What is a "model" versus a "system" built on a model?

How do I know if a foundation model is appropriate for my use case?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?

What Foundation Models Actually Are, for People Who Build

What a Foundation Model Actually Is

The Key Distinction from Traditional ML

The Architecture Behind the Power

Self-Attention

Pretraining Objectives

Major Foundation Model Families

Large Language Models (LLMs)

Vision and Multimodal Models

Code Models

How Adaptation Works

Prompting and In-Context Learning

Fine-Tuning

Retrieval-Augmented Generation (RAG)

Reinforcement Learning from Human Feedback (RLHF)

Evaluating Foundation Models

Performance on Your Task

Context Window

Cost and Latency

Open vs. Closed Weights

Risks You Need to Understand

Hallucination

Bias and Representation

Prompt Injection

Data Privacy

Putting Foundation Models to Work in an Agency Context

Frequently Asked Questions

What is the difference between a foundation model and a large language model?

Are open-weight models as good as closed models like GPT-4?

How much does it cost to fine-tune a foundation model?

Will foundation models become obsolete quickly?

What is a "model" versus a "system" built on a model?

How do I know if a foundation model is appropriate for my use case?

Key Takeaways

Agency Script Editorial

Related Articles

Rolling Out AI Hallucinations Across a Team

A Model Behind an API Is Only Potential

Case Study: Large Language Models in Practice

Ready to certify your AI capability?