While large language models (LLMs) have claimed the spotlight since the debut of ChatGPT, BERT language models have quietly handled most enterprise natural language tasks in production.
As foundation models, large LLMs like GPT-4 and Gemini consolidate internet-scale text datasets and excel at a wide range of tasks. While the performance of these models is impressive, so too are their computational costs. These models are expensive to run, even for inference, and can severely tax resource-constrained enterprise workloads.
Thus, it’s important to remember that the latest and greatest in LLM tech is built upon years of prior research, and many of the previous generation of models, especially Google’s BERT, still provide great performance at a lower cost.
Additionally, while the data and code needed to train some of the latest generation of models is still closed-source, open source variants of BERT abound. Enterprise data science teams can adapt these BERT transformer models quickly and cleanly.
BERT origins and basics
Researchers released Google BERT in October 2018, not long after the seminal Attention is All You Need paper (which introduced the transformer building block for large language models.) This makes BERT one of the original LLM architectures. It’s also one of the simplest.
BERT’s architecture heavily utilizes transformer layers to achieve excellent performance on a range of tasks. It is an encoder-only architecture, meaning that input text is only encoded into a finite-dimensional vector representation, and never decoded back into text.
BERT can handle a variety of modeling tasks, including:
- Question answering
- Natural language understanding
- Text classification
- Sentiment analysis
- Next word prediction
Unlike larger natural language models, such as GPT-3, Google BERT performs poorly on machine translation. This typically requires the model to encode text from the source language and then decode it into the target language using a trained decoder module.
Still, the simplicity of BERT is a marked benefit. Popular LLMs such as PaLM 2 and GPT-4 require complex distributed systems of GPUs for inference and fine-tuning. In contrast, BERT training pipelines often fit on modern laptops, and data scientists can fine-tune a variety of BERT derivatives to adapt them to new tasks through transfer learning
BERT’s NLP advantages
BERT offers several advantages relative to other LLMs. It has achieved widespread adoption in both industry and research, which has encouraged researchers and data scientists to publish a variety of pre-trained BERT models as well as comprehensive tutorials. BERT also excels at several tasks that are common in enterprises, namely text classification, data labeling, and ranking and recommendation.
Text classification and representation
What BERT does well, it does really well. A transformer BERT model is often the go-to for text classification and representation. BERT excels at building vector representations for text, and those representations can then be used in a variety of downstream tasks.
BERT shines in semi-supervised data labeling. Data scientists who need data to train a complex model can use pre-trained BERT LLM architectures to predict labels for unlabelled data.
For example, a pre-trained BERT LLM equipped with a classification layer can provide sentiment analysis labels. A data scientist can then use these labels to train a smaller classification model and deploy it in an enterprise pipeline. This lets enterprise data science teams build accurate models faster, without waiting for and relying on human-annotated data.
Ranking and recommendation
Because transformer BERT LLM models produce high-quality text representations, these representations often provide a natural choice as inputs for ranking and recommendation services. By computing similarities between representations, data scientists can use BERT vectors to rank objects such as products or user reviews in e-commerce settings. Google, for example, uses BERT to rank search results, and developers have also developed BERT-based systems to recommend products in Amazon’s marketplace.
BERT models are much smaller than the current generation of LLMs, allowing them to be trained on single GPUs and sometimes even laptops. Furthermore, using a machine learning technique called knowledge distillation, researchers created smaller versions of BERT, such as DistilBERT, which retain most of BERT’s performance in a fraction of the parameter count. Some of these models can even be run on embedded devices and phones.
For reference, here are some popular large language models ranked according to size:
- GPT-4 (~1 trillion)
- GPT-3 (175 billion)
- Llama (65 billion)
- T5 (11 billion)
- Alpaca (7 billion)
- GPT-2 (1.5 billion)
- BERT-Large (340 million)
- BERT-Base (110 million)
- DistilBERT (66 million)
BERT’s computational efficiency enables accelerated development and deployment. Data scientists can train or fine-tune DistilBERT and similarly compressed BERT models in hours rather than days or weeks. Data teams can often fine-tune distilled BERT variants using comparatively small amounts of in-house data and far exceed the performance of simpler models.
BERT language models: not built for generation
One caveat of BERT compared to other LLMs is that it is not designed to handle text generation.
While it’s not strictly impossible to generate text using BERT, it’s not straightforward due to its bidirectional architecture. Bidirectional in this context means the model is trained to predict the next word in a sentence when reading the sentence both forwards and backwards.
Generative AI LLMs typically perform language modeling only in the forward direction. Researchers and developers have built demonstrations using BERT for text generation, but doing so requires awkward software architectures and produces lower-quality output than the current generation of generative text models.
However, BERT can also be used as a “helper” when training true text generation models. For example, in this paper, researchers from Microsoft and Carnegie Mellon used BERT as the teacher in a student-teacher setup for training a sequence-to-sequence text generation model.
HuggingFace and similar libraries support using BERT for text generation, although the use of architectures designed specifically for this task, such as GPT-2, is recommended.
BERT LLM: The chameleon model
BERT is truly chameleon-like in its capabilities—adaptable as it is to a wide variety of tasks and settings.
Data scientists typically adapt BERT through what are often called neural adaptation layers or, more colloquially, heads. A task-specific head allows a user to take a BERT base model and adapt it to a given task. Each task typically has its own type of adaptation layer.
- Linear layers (with optional softmax) are often used as the head in classification settings to output raw logit scores (or probabilities) for each class.
- Sequential layers such as LSTMs are useful as heads for tasks such as summarization and translation.
- Linear layers combined with an output softmax are useful for language modeling and question answering.
The BERT documentation published by Hugging Face includes a full discussion of the variety of BERT modeling heads and their uses.
Building a classification model
Here we walk through a simplified version of using BERT’s NLP abilities to build a text classification model.
- Gather the training data, including examples of each of the text classes.
- Tokenize the data, using one of Hugging Face’s BERT-specific tokenizers.
- Download a pre-trained BERT classification model. This is the code you can use in HuggingFace:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(“bert-base-cased”, num_labels=5)
- Finetune the model on your classification data, perhaps using the HF Trainer class.
- Evaluate your model on a held-out validation set.
- Rinse and repeat as needed.
You can get a more detailed look at this process in the HuggingFace finetuning tutorial.
Google BERT Usage in the real-world
While the current generation of LLMs are gaining enterprise adoption, BERT has already achieved ubiquity in real-world contexts.
Google integrated BERT into Google Search, using it to surface highly accurate results for almost every English query. Similarly, a technical blog post from Wayfair shows how they use BERT to glean insights from unstructured text such as customer product reviews and feedback. BERT is also used to mine sentiment in financial documents, with one of the world’s largest technology investors, Prosus, using it to guide investment decisions. Additionally, variants of BERT have been fine-tuned for legal, scientific, and biomedical applications.
In fact, you’d be hard-pressed to find an enterprise domain to which the BERT language model hasn’t been applied.
Data scientists and researchers have built many data-and-task-specific BERT variants. Here are a few of the most popular ones:
- DistilBERT (compact, efficient, distilled version of BERT)
- SciBERT (trained on scientific texts)
- BioBERT (trained on Biomedical text)
- BigBird (designed to model longer sequences)
- FlauBERT (for French language modeling)
- SqueezeBERT (an efficient form of BERT using convolutional layers)
- MobileBERT (designed specifically to run on phones and other mobile devices)
- HerBERT (for Polish)
- BERTweet (for understanding tweets)
All of these models, and many more, are available for free download on the Hugging Face Model Hub.
BERT: the workhorse LLM
Due to its low computation requirements, easy-to-understand architecture, and the large availability of open source fine-tuned models, BERT is an excellent choice of large language model for enterprises. It can be used in a variety of niche domains and optimized for performance across a range of tasks, from text classification, to question answering, to language representation, and more!
If you'd like to learn how the Snorkel AI team can help you develop high-quality LLMs or deliver value to your organization from generative AI, contact us to get started. See what Snorkel can do to accelerate your data science and machine learning teams. Book a demo today.