If your LinkedIn feed is anything like mine, you saw an endless stream of posts on Tuesday  extolling the capabilities (and potential) of Meta’s new Llama 3.1 family of models – and with good reason. We finally have an open model which can best OpenAI’s GPT-4o in several standard benchmarks. It’s monumental. So, now what?

If you have the hardware and/or budget to run a fine-tuned Llama 3.1 405B (let’s call it Larry), more power to you! Larry may be out of reach for most as a primary LLM because it requires ~800GB of GPU memory and, via an inference provider, it’s ~18x more expensive than the 8B model. However, that doesn’t mean everyone can’t take advantage of it.

Sensei Larry

In terms of LLM adoption, sophistication and maturity, I see RAG as phase one and fine-tuning/alignment as phase two. First, you stand up a RAG pipeline, likely with an OOTB LLM. Next, you fine tune the LLM to improve accuracy. In this context, I see distillation as phase three. I won’t go into the details, but the idea is to distill a large model into a small one, by using the reasoning and responses of a foundation model to curate training data for a smaller model. It’s a teacher/student approach. You can learn more about our approach to LLM distillation here.

Larry is on par with GPT-4 and may soon become the preferred teacher for training small LLMs to excel at specific tasks. In fact, we could (and perhaps should) have Larry teach Llama 3.1 8B models which are specialized for different tasks. After all, a specialized 8B model doesn’t need to generalize as well as Larry. It simply needs to match Larry’s capabilities for a specific task.

We may not be able to hire Larry as a full-time sensei in our AI dojo, but we can bring him in as a guest sensei for a lesson here and a lesson there.

Judge Larry

Larry is not only a great source for distillation, it’s a great candidate for LLM-as-a-judge. In last week’s webinar, Snorkel unveiled a new approach to evaluating LLMs for enterprise use cases – as in, evaluating specialized LLMs and/or RAG pipelines on domain-specific tasks. I won’t go into all the details, you can watch it here, but it’s more/less a three-step process.

  1. SME accepts or rejects sample responses → ground truth
  2. LLM accepts accurate and complete responses → acceptable response candidates
  3. Label functions (encoded SME knowledge) reject violations → accepted responses

It’s not practical to have an SME (or team of SMEs) manually accept or reject thousands of responses. However, when it comes to ground truth, SMEs are the gold standard.

In terms of whether thousands of responses are accurate and complete, an LLM is the best tool available. We can use one to create a baseline. Specifically, to identify responses which may be acceptable.

They may be acceptable because an accurate and complete response is not necessarily an acceptable one. For example, an LLM may accept a response, but that response may violate a corporate policy by recommending a competitor or being curt and unhelpful.

OpenAI’s GPT-4 has been the best judge available. However, with Larry, we now have an open model that’s just as capable (if not better), and can be self-hosted or accessed via the inference provider of your choice. If you need an LLM-as-a-judge, ask for Larry.

Baby Larry

Of course, the best option may be to do both. While using Larry to judge another LLM is less expensive than using Larry as the primary LLM, it’s not necessarily free. However, what if we had Larry teach a smaller Larry (Baby Larry) to judge responses? Then, we fine-tuned Baby Larry to judge responses from a specific AI copilot or assistant. Why stop there? Let’s assume this copilot or assistant is powered by a fine-tuned and aligned Llama 3.1 8B model. We’ll call it Lil’ Larry.

What an amazing way to take advantage of a powerful family of open models.

  1. Fine-tune and align Lil’ Larry on domain-specific tasks
  2. Distill Larry into Baby Larry with a focus on judging LLM responses
  3. Fine tune Baby Larry to judge Lil’ Larry’s responses specifically

The end result is the power of Larry at the cost of Lil’ Larry.

Larry for president, 2028!

Conclusion

Meta’s new Llama 3.1 family of models is a monumental achievement. And with Larry (405B), we have an open model on par with OpenAI’s GPT-4. However, it’s one thing to get excited about something new. It’s another to act on it.

We can’t wait to see how innovators will take advantage of this opportunity.

And yes, I just finished watching the first part of the final season of Cobra Kai.