In The News

Why The Future Of Generative AI Lies In A Company’s Own Data

Published: October 17, 2023

Summary

While large language models (LLMs) have become accessible, building a truly valuable Generative AI tool requires more than off-the-shelf parts. Proprietary data is crucial for creating a sustainable competitive advantage.

To leverage proprietary data effectively, businesses can employ three strategies:

Retrieval augmentation: Enrich prompts with relevant information from internal resources.
Fine-tuning: Customize the LLM’s output for specific tasks using carefully curated prompts and responses.
Self-supervised pre-training: Build a custom LLM from scratch using proprietary data.

Implementing these strategies often involves significant data labeling efforts. However, by carefully curating and preparing data, organizations can unlock the full potential of their proprietary information and create a powerful AI moat.

Recommended
articles

See all articles

Research

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

To kick off our inaugural Benchtalks, a series dedicated to the researchers building these measurement toolkits, Snorkel AI co-founder Vincent Sunn Chen sat down with Alex Shaw, Founding MTS at Laude Institute and co-creator of Terminal-Bench and Harbor. Highlights More on Terminal-Bench: See the leaderboard and the catalog of tasks at tbench.ai. Explore Harbor: Learn how to scale your agent…

Vincent Sunn Chen

March 31, 2026

Research

Building FinQA: An Open RL Environment for Financial Reasoning Agents

TL;DR: We built FinQA — a financial question-answering environment with 290 expert-curated questions across 22 public companies, now available on OpenEnv. Agents use MCP tools to discover schemas, write constrained SQL queries, and answer multi-step questions from real SEC 10-K filings. Most open-source models struggle with this kind of multi-step tool use, and even frontier closed-source models, while more accurate,…

Bhavishya Pohani

March 30, 2026

Research

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

The Snorkel research team collaborated with the rLLM team at UC Berkeley on the Agentica project, using their open-source rLLM framework to fine-tune Qwen3-4B-Instruct-2507, delivering a model that beats Qwen3-235B-A22B on Snorkel AI’s expert-curated financial benchmarks – at 1/60th the size. A full breakdown of the results are published in the rLLM blog here. The key insight? Just focus on…

Chris Glaze

February 18, 2026

Join our newsletter for expert advice, the latest research, and exclusive events.

By submitting this form, I acknowledge I will receive email updates from Snorkel AI, and I agree to the Terms of Use and acknowledge that my information will be used in accordance with the Privacy Policy.

Why The Future Of Generative AI Lies In A Company’s Own Data

Summary

Recommended articles

Benchtalks #1: Alex Shaw (Terminal-Bench, Harbor) – Building the Benchmark Factory

Building FinQA: An Open RL Environment for Financial Reasoning Agents

How Tool Discipline Let a 4B Model Outsmart a 235B Giant on Financial Tasks

Join our newsletter for expert advice, the latest research, and exclusive events.

Recommended
articles