Foundation Models (FMs), such as GPT-3 and Stable Diffusion, mark the beginning of a new era in machine learning and artificial intelligence. What are they and how will they impact your business? Find out in the guide below.
Table of contents
What are foundation models?
Foundation models are large AI models trained on enormous quantities of unlabeled data—usually through self-supervised learning. This process results in generalized models capable of a wide variety of tasks, such as image classification, natural language processing, and question-answering, with remarkable accuracy.
FMs excel at generative, human-in-the-loop tasks, such as writing marketing copy or creating detailed art from a simple prompt. But adapting and deploying them for enterprise use cases can pose challenges. When a task exceeds a foundation model’s capabilities, it can return an incorrect and fabricated “hallucination” that appears as plausible as a correct response.
That’s where the “foundation” in foundation models comes in. Data scientists can build upon generalized FMs and fine-tune custom versions with domain-specific or task-specific training data. This approach greatly enhances their domain- or task-specific performance, and could open new worlds of capabilities for organizations spanning many industries.
How do foundation models generate responses?
Foundation models underpin generative AI capabilities, from text-generation to music creation to image generation. To do this, FMs use learned patterns and relationships to predict the next item or items in a sequence. In the case of text-generating models, that’s the next word or phrase. For image-generation models, that’s the next, less blurry version of an image. In either case, the model starts with a seed vector derived from a prompt.
Due to the way FMs choose the next word, phrase or image feature, foundation models can generate an enormous number of unique responses from a single prompt. The models generate a probability distribution over all items that could follow the input and then choose the next output randomly from that distribution. This randomization is amplified by the models’ use of context; each time the model generates a probability distribution, it considers the last generated item—which means each prediction impacts every prediction that follows.
For example, let’s say we start a foundation model with a single token: “Where.” Any number of words could follow that token—from “dancing” to “pizza” to “excavators”—but variations on the verb “is” will be more common. Let’s say “is”, “are”, “was”, and “were” each has a probability of 0.1, and they’re stacked at the beginning of the distribution. Our model will randomly pick a value between zero and one. If that value is less than 0.4, it will select a variation of “is.” Assuming it picks “is,” the model will now generate a new probability distribution for what words could follow “Where is”—which will likely lean heavily toward possessive pronouns like “your” and “my.”
The model continues this way until it generates a response that it predicts to be complete. In this case, that might be “Where is my sweater? I left it here yesterday.”
What is self-supervised learning?
Self-supervised learning is a kind of machine learning that creates labels directly from the input data. For example, some large language models generate embedding values for words by showing the model a sentence with a missing word and asking the model to predict the missing word. Some image models use a similar approach, masking a portion of the image and then asking the model to predict what exists within the mask. In either case, the masked portion of the original data becomes the “label” and the non-masked portion becomes the input.
This differs from previous generations of machine learning architectures, which fell into two categories: supervised and unsupervised. Unsupervised learning requires no labels and identifies underlying patterns in the data. Model architectures that qualify as “supervised learning”—from traditional regression models to random forests to most neural networks—require labeled data for training.
FAQs (frequently asked questions) about foundation models
Foundation models are a new and emerging field, and many people have questions about them. We’ve collected some frequently asked questions about foundation models here, and answered them.
What are some examples of Foundation Models?
The field of foundation models is developing fast, but here are some of the most noteworthy entries as of this page’s most recent update.
BERT, an acronym that stands for “Bidirectional Encoder Representations from Transformers,” was one of the first foundation models and pre-dated the term by several years. The open-source model, the first to be trained using only a plain-text corpus, quickly became an essential tool for natural language processing researchers.
BERT models proved useful in several ways, including quantifying sentiment and predicting the words likely to follow in unfinished sentences.
ChatGPT elevated foundation models into the public consciousness by letting anyone interact with a large language model through a user-friendly interface. The service also maintains a state the stretches back over many requests and responses, imbuing the session with conversational continuity. The technology demonstrated the potential of foundation models as well as the effort required to bring them to a production use-case; while an LLM serves as ChatGPT’s backbone, OpeanAI built several layers of additional software to enable the interface.
The “GPT” in GPT-3 stands for “Generative Pre-trained Transformer.” GPT-3 is best known as the original backbone of ChatGPT. This model debuted in June 2020, but remained a tool for researchers and ML practitioners until its creator, OpenAI, debuted a consumer-friendly chat interface in November 2022.
GPT—with or without its “Chat” wrapper—proved useful for generating text on demand from human-readable prompts.
DALL-E, produced by OpenAI, is a multi-modal implementation of GPT-3 trained on text/image pairs. The resulting model allows users to describe a scene, and DALL-E will generate several digital images based on the instructions.
DALL-E can also accept images as an input and create variations on them.
Stable Diffusion, released in 2022, offers capabilities similar to DALL-E; It can create images from descriptions, in-paint missing portions of pictures and extend an image beyond its original borders.
It differs from DALL-E by using a U-net architecture. This approach uses successive neural network layers to convert the original visual information describing colors and light levels into increasing levels of abstraction until it reaches the middle of the “U.” The second half of the U-net expands this abstraction back into an image.
Applying Foundation Models
Foundation Models have the potential to make business contributions through a number of applications, including the following non-exhaustive list.
Foundation Models such as BERT can analyze customer feedback, reviews, and social media posts to determine the sentiment towards products or services. This can help provide valuable insights for product development and marketing strategies. While these models struggle on edge-cases, such as sarcasm or irony, they achieve high accuracy rates in aggregate.
Chatbots and virtual assistants
ChatGPT demonstrated that foundation models can serve as the seed for competent chat bots and virtual assistants that may help businesses provide customer support and answer common questions. However, building a chat bot or virtual assistant on top of something as generalized as ChatGPT could lead to embarrassing moments where the bot answers questions out-of-line with business priorities. Further constraints would be necessary.
Content generation (written)
Foundation models can help businesses generate content, such as product descriptions or marketing copy. However, the models may struggle with generating text that could be called “creative” or that captures the unique voice and tone of the business. Additionally, some generated content may be repetitive or nonsensical.
Content generation (visual)
Multi-modal foundation models could help businesses—particularly design-focussed businesses—generate rough drafts of visual ideas. While the images created by the current generation of foundation models are unlikely to meet a business’s high standards, they can serve as a rapid brainstorming tool that allows a human designer to identify the most promising design and create a final version of it.
A business operating in a multilingual environment could use foundation models to translate product descriptions, marketing materials, and customer support content into different languages. However, the models may struggle with translating idiomatic expressions, cultural references, or other language-specific nuances that human translators would likely handle better.
Foundation models can help businesses summarize and extract relevant information from customer feedback, such as support requests or online reviews. However, foundation models may struggle with accurately identifying the relevant information, particularly if the information is presented in a non-standard format. Additionally, the models may need to be fine-tuned for specific domains or types of data to achieve optimal performance.
Exciting business applications
Bing + ChatGPT
In 2023, Microsoft began experimenting with a closed beta version of Bing that incorporated a chat interface powered by ChatGPT. The interface allowed users to make complex requests and receive a human-readble response annotated with web links. For example, a New York Times reporter asked Bing to recommend e-bikes that might fit in the back of his Toyota Highlander. The interface recommended two such bikes based on their small size.
GitHub Copilot is an AI-powered code assistant. It is based on GPT and uses machine learning algorithms to generate code suggestions as developers write. Copilot can suggest entire code blocks, comments, and function calls based on the context of the code and the developer’s previous coding patterns. Copilot has the potential to significantly improve the speed and efficiency of coding.
Dog and Boy
In January 2023, Netflix announced the release of an animated short film called “Dog and Boy” that used artificial intelligence to assist with the creation of background images. A walkthrough at the end of the film demonstrated the process for one particular image. A human submitted a rough sketch to an AI, which returned a high-resolution rendering. The human then requested alterations, yielding a second AI-generated draft, that the human hand-revised into a final form.
Challenges to FM adoption in enterprises
While many businesses are excited and eager about foundation models, there are a number of reasons why an enterprise may not want to deploy them in production—now or in the future.
Foundation models are extremely complex and require significant computational resources to develop, train, and deploy. For narrowly-defined use-cases, that cost may not be justifiable, when a smaller model may achieve similar (or better) results for a much lower price.
Foundation models are often described as “black boxes.” The humans using them may never understand how the models arrive at their predictions or recommendations. This can make it challenging for businesses to explain or justify their decisions to customers or regulators.
Privacy and security
Foundation models often require access to sensitive data, such as customer information or proprietary business data. This can raise concerns about privacy and security, particularly if the model is deployed in the cloud or accessed by third-party providers.
Legal and ethical considerations
The deployment of foundation models may raise legal and ethical considerations related to bias, discrimination, and other potential harms. The models are trained on an enormous quantity of data from the wild, and not all of that data will align with your business’s values. Businesses must ensure that their models are developed and deployed in a responsible and ethical manner, which may require additional oversight, testing, and validation.
Foundation models come pre-trained on massive amounts of wide-ranging data and may not be well-suited to a specific business’s needs. Out-of-the-box, FMs often fall well short of a busines’s required accuracy rate for a production application. Fine-tuning the model with domain-specific training data may push the FM over the bar, but a business may struggle to justify the time and cost required to do so.
What’s necessary for enterprises to adopt FMs?
Organizations determined to adopt Foundation Models must clear several hurdles to properly and safely use them for production use-cases.
For most enterprise use-cases, using a foundation model via API is not an option. OpenAI is very open about incorporating user data into their model training, which means a business using the service could find its proprietary information popping up as a ChatGPT response. Even if a vendor doesn’t use your data in their model, sending sensitive information to an API adds one more opportunity for malicious actors to access your data.
Out-of-the-box foundation models trained on general knowledge will struggle on domain-specific tasks. To improve the model’s performance to the point where business leaders feel comfortable using it, data scientists will have to gather and prepare data for fine tuning.
Foundation models are computationally complex and expensive to run. A report for Ars Technica noted that a ChatGPT-style search interface would cost roughly 10 times as much as Google’s standard keyword search. But organizations need not use the largest of foundation models for their end use caes. They may instead use them to help train smaller, more focused models that can achieve the same (or better) performance for a fraction of the price.
Researchers have published hundreds of papers relevant to the advancement of foundation models and large language models, but the following papers roughly sketch the trajectory of the field.
Radford et al. (2016)
This paper introduced DCGANs, a type of generative model that uses convolutional neural networks to generate images with high fidelity.
Vaswani et al. (2017)
This paper introduced the Transformer architecture, which revolutionized natural language processing by enabling parallel training and inference on long sequences of text.
Devlin et al. (2018)
This paper introduced BERT, a language model that uses bidirectional context to better understand the meaning of words in a sentence. BERT has become a widely used pretraining model in natural language processing.
Brown et al. (2020)
This paper introduced GPT-3, a language model that can perform a wide range of natural language tasks with little or no task-specific training. GPT-3 is notable for its large size (175 billion parameters) and its ability to generate coherent and convincing text.
Ramesh et al. (2021)
This paper introduced DALL-E, a generative model that can create images from textual descriptions. DALL-E has demonstrated impressive capabilities in generating realistic and imaginative images from natural language input.
Rishi Bommasani, Percy Liang, et al. (2021)
This paper highlights progress made in the field of foundation models, while also acknowledging their risks—particularly the potential ethical and societal concerns, the impact on job displacement, and the potential for misuse by bad actors.
Foundation Model Landscape
The foundation model landscape is vast and varied. Academic institutions, open-source projects, exciting startups and legacy tech companies all contribute to the advancement of the field. This technology has moved fast and continues to do so. Compiling a complete snapshot of current FM resources is an enormous task beyond the scope of this document, but the following non-exhaustive list sketches some important contours in the landscape.
The Stanford Center for Research on Foundation Models
Founded in 2021, the Stanford Center for Research on Foundation Models (CFRM) focuses on advancing the development and understanding of robust, secure, and ethical foundation models. CFRM aims to address the technical, social, and ethical challenges foundation models present and to develop solutions that can benefit society through research and partnering with government agencies and corporations.
Founded in 2015, OpenAI conducts cutting-edge research in machine learning, natural language processing, computer vision, and robotics, and shares its findings with the scientific community through publications and open-source software. OpenAI is responsible for debuting GPT-3, DALL-E and ChatGPT. The company has partnered with Microsoft since 2019. In early 2023, Microsoft began integrating ChatGPT into the Bing search engine.
Stylized as “co:here” and co-founded by an author of one of the papers that launched the field of foundation models, Cohere offers a suite of large language models via API. By using Cohere’s software libraries and endpoints, developers can build applications that understand written content or generate written output without having to train or maintain their own LLMs.
ArXiv.org hosts and distributes scientific research papers form many disciplines, including mathematics, physics, and computer science. Members of the scientific community—including many members of the foundation model research community—use ArXiv.org as a way to share preprints of research papers before they are published in academic journals. Cornell University operates and maintains the site with funding from several organizations, including the Simons Foundation and the National Science Foundation.
Hugging Face develops and maintains open-source resources that allow programmers to easily access and build upon foundation models, including BERT, GPT, and RoBERTa. The company is best known for NLP tools, but also enables the use of computer vision, audio, and multimodal models. Hugging Face’s contributions to the NLP community have helped accelerate progress in the field and make it more accessible to developers and businesses.
Google has developed several large-scale models with important impacts on the field of foundation models, including T5 and BERT—the latter of which has become a standard tool for many NLP researchers. In February 2023, Google publicly debuted Bard, its large language model intended to compete with GPT-3.
Microsoft launched its Language Understanding Intelligent Service in 2016. The cloud-based NLP platform enables developers to create and deploy custom NLP models for use in applications. The next year, the company launched Tay, a doomed early public experiment in conversational understanding. More recently, Microsoft has partnered closely with OpenAI and experimented with integrating ChatGPT into the Bing search engine.
Snorkel’s work on foundation models
Snorkel and researchers associated with Snorkel actively pursue ways to make foundation more usable and more widely understood. Snorkel’s work on these topics is ongoing, but here’s a sample.
Data-centric Foundation Model Development: Bridging the gap between foundation models and enterprise AI
In November 2022, Snorkel announced an update to the Snorkel Flow platform that incorporated three features built upon foundation models: foundation model fine-tuning, Warm Start and Prompt Builder. Prompt Builder allows users to construct labeling functions using plain-text prompts delivered to foundation models. Warm Start uses foundation models to automatically create an initial collection of labeling functions at the push of the button, and foundation model fine-tuning lets users easily build customized versions of GPT-3, CLIP and other foundation models.
In 2022, Snorkel researchers and our academic partners published seven papers on foundation models, including a 161-page deep dive into the promise and peril presented by these incredibly potent, but poorly understood, tools. Other papers devised strategies to coax greater performance from foundation models – sometimes while simultaneously decreasing their size and cost.
In this case study, we show how Snorkel customer Pixability used foundation models to greatly speed up the development of its content categorization model. Using Snorkel Flow’s Data-centric Foundation Model Development workflow, Pixability was able to build an NLP application in less time than it took a third-party data labeling service to label a single dataset. This data-centric workflow allowed Pixability to scale up the number of classes they could classify to over 600 while also increasing model accuracy to over 90% with the new workflow.
In January 2023, Snorkel held its one-day Foundation Model Virtual Summit, bringing together 12 presenters and over 600 attendees at 10 virtual sessions. The event drew registrants from many sectors, including the tech industry, healthcare, and financial services.
Reflections on Foundation Models
This blog post from Stanford Human-Centered Artificial Intelligence follows up on the founding of Stanford’s Center for Research on Foundation Models with a deep think on how foundation models and their training data should be handled going forward.
What are foundation models?
This brief explainer gives IBM’s take on the history of foundation models and their enterprise applicability, with a particular focus on their Watson portfolio.
Foundation Models: The future isn’t happening fast enough — Better tooling will make it happen faster
Venture capital firm Madrona offers its opinion on the promise and peril of foundation models. In their take, only a handful of applications built upon foundation models have seen success so far, and insufficient tooling presents a big stumbling block preventing more firms from building successful applications on top of FMs.
A ‘New World’ for Enterprise AI
Foundation Capital outlines their view that ChatGPT represents a “Netscape moment” for enterprise AI. The post also describes six layers for the modern AI stack.
Generative AI is here: How tools like ChatGPT could change your business
This post from McKinsey QuantumBlack delves into the enterprise value companies could get from generative AIs, including specific example use-cases across marketing, sales, operations, IT, legal and human resources functions.