Former U.S. Chief Data Scientist on past and future of data science
Former U.S. Chief Data Scientist DJ Patil recently sat down with Alex Ratner, Co-Founder and CEO of Snorkel AI. The two held a wide-ranging conversation covering topics including what it was like to be the first U.S. Chief Data Scientist, and where the title “data scientist” came from. A transcript of their conversation follows, edited for brevity and clarity.
Alex Ratner: I want to start with a basic question. You were the first Chief Data Scientist of the United States. That sounds awesome. Tell us a little bit about how that all started, and what you think was the most exciting accomplishment from serving in the position.
DJ Patil: I think the first thing that we need to acknowledge is that the U.S. has had an incredible use of data throughout history.
“The U.S. has had an incredible use of data throughout history.”
Dj Patil
George Washington was really focused on how to think about mapping. He was a cartographer, in addition to all the other things that he did. You had the census, which was hard-written into the Constitution to make sure that the United States was collecting population data. In the 1930s, President Franklin Delano Roosevelt made tremendous efforts to change how we think about our economy in terms of data.
That was really the beginning of the data ecosystem, and there has been a US Chief Statistician since 1939. While that has been a very important role, it’s gradually become burdened with many other duties. We also have a National Economic Council and the Council of Economic Advisors, who do tons of work with America’s data.
What President Obama saw happen during his campaigns was the growing use of data in novel ways, predominantly aimed at engagement in the community in order to get people out to vote. When Obama entered office, one of the first positions he created was the US CTO, or Chief Technology Officer for the United States, with responsibility for how we think about technology and all the technological and data tools that are now becoming so important so quickly. For the issues that we’re talking about today, like for example with AI, the CTO is the point person for those things.
The first two Chief Technology Officers—first was Aneesh Chopra then Todd Park—their focus was about opening up data. They wanted to find ways to release federal data so people could use it. There’s so much data that’s out there that can be utilized, from weather data and census data to healthcare data and economic forecasts. As the third CTO was taking office—that was Megan Smith—the administration realized that somebody had to continuously ensure that the office of the presidency had access to the best data as well as to the best available ways to use it. In the time since then, the head of the Office of Science and Technology Policy, the CTO, and the US Chief Data Scientist have all evolved to suit this need. They have different but interlocking roles. much like the National Security Council has many different experts for different kinds of issues.
“When Obama entered office, one of the first positions he created was the US CTO, or Chief Technology Officer for the United States.”
DJ Patil
That purview turned into a mission statement, which is to responsibly unleash the power of data to benefit all Americans. And, it is super timely given all the conversations around AI right now, because the first part of that statement is responsibly unleash the power of data, to benefit all Americans—everyone, collectively. These are very deliberately chosen words.
But, if you take this job, where do you start? The basic criteria we established was that in order to be under the CTO’s purview, a given issue has to meet one of three criteria: impact more than 1 trillion of US spending, impact at least 50 percent of the US population, or help a population that has no other recourse.
What are the problems that meet those criteria? Healthcare, certainly. It represents 20 percent of U.S. GDP and includes costs, access to care, and more. For example, the CTO researches ways to enable more precision medicine, the nuanced ethical issues surrounding tailored genomic treatments, and the “moonshot” of finding a cure for cancer.
Another problem that meets the criteria is criminal justice reform. The CTO has a voice in, for instance, how we think about body cameras as well as the privacy concerns around the data from those cameras. What happens if someone attempts to alter images and footage? When victims are in the footage, how do we control who has access to the data?
Alex Ratner: That’s fascinating. I hadn’t thought about what it must have been like to show up and have all of the problems that you could potentially approach and have the burden of choosing which ones to focus on.
DJP: When you’re at the White House, the straightforward problems have all been solved by somebody else already. The problems that make it near the Oval Office are usually pretty complex and nuanced.
“When you’re at the White House, the straightforward problems have all been solved by somebody else already.”
DJ Patil
AR: Just focusing on the terms data science and data scientist—what do you think is powerful about that term? Does it accentuate what you think needs to be done differently under a data-science mandate, versus, say, a domain expert or a statistician or an economist?
DJP: There have been many origin stories of data scientists, and I would argue that many of the people who’ve been doing data science for a long time have been forgotten. Historically, groups ranging from the Mayan people of the Yucutan Peninsula to World War II code breakers were all doing data science. The more modern version perhaps began at places like Facebook and LinkedIn a decade or so ago, where we were using data as a front-facing product for the consumer for essentially the first time.
Some of these features may seem obvious now, but for example the “people you may know” feature; “who viewed my profile”; job recommendations, etc.; The technology to support these things didn’t really exist at the time. We were just starting to see open-source databases. The notion of data democratization was new.
At LinkedIn, when we looked at potential titles for our teams, “research scientists” was the predominant one that was put forward by Yahoo. They were doing some of that earliest work, but often the research scientist was doing something more academic and they never got pulled into the product team’s world.
The insight we had at LinkedIn was that data can’t be just a back-office team. It can’t be beholden to product. It is a front office team. It owns its own profit and loss. It is responsible for engagement. It has designers, it has engineers, it has everything. And it lives and dies based on its ability to execute on the company’s mission.
That was very different from how almost every other place approached the question. As Facebook set up its team and as I was doing the same at LinkedIn, we shared a lot between us, as well as what was happening with the Yahoo and Google teams. No one had great names for ourselves. But Monica Rigotti had this great idea when we were playing around with the list of job candidates. She asked, “why don’t we test every job title on LinkedIn, and let’s see what people apply to.” Everyone applied to the “data scientist” one.
What we realized, I think, is that the term data scientist won out because it is ambiguous. That prevents you from being put in a box. If you’re the “research scientist” you might say “this problem is too practical, go away.” The “business intelligence” is mostly building dashboards and looking up SQL queries. But “data scientist” is like “nerd,” in that it can give someone purview over a lot of different but interlinked problems.
“I think that the term data scientist won out because it is ambiguous. That prevents you from being put in a box… ‘Data scientist’ is like ‘nerd,’ in that it can give someone purview over a lot of different but interlinked problems.
DJ Patil
The important point is that data science is a team sport. It’s never just one individual, from a skillset perspective. Our team was able to be the “glue,” if you will, between the economists, the statisticians, the scientists at NIH, the people at NASA, etc. This is why I think the data science major has taken off at universities. It is the one horizontal that is interdisciplinary, that supports the whole aspect of a university. You have license to go and pull any multiple departments together, and that is the same phenmenon we are seeing with the rapid development of AI.
I believe our collective focus should be on what I call “MIPs”— these massive, multi-interdisciplinary problems. That work is what’s going to change the game for our country and the broader landscape.
Alex Ratner: My experience is much narrower, but I love the interdisciplinary aspect and the ambiguity and the flexibility it affords.
The interdisciplinary nature of data science is something we’ve encountered a lot at Snorkel. Going back to the first DARPA project that we worked on at Stanford, the early open-source research Snorkel project was DARPA Simplex. It was wonderful because it paired a data science or ML/AI team with a subject matter expert, an SME team. I believe that pairing was one of the most fundamental determinants of success.
If you want to be a good data scientist, you need depth in the core data science tenets and you need depth in at least one subject matter area on your team. It is critical for what data scientists have achieved and for what they might potentially achieve moving forward.
Snorkel builds a data-centric platform for data scientists, and the most successful teams are anchored on a real, front-facing business or front-facing objective, rather than to a specific role. They have the freedom to do whatever they want.
DJP: They can be problem-focused, rather than people-focused.
AR: Exactly. If I need to do some research, I’ll do it because it is necessary to solve the problem at hand. If I need to do some statistics, I’ll do that too. It’s just problem-oriented.
DJP: One issue with trying to restrict “data scientists” to other terms is that we’re getting hung up on our own egos. The world doesn’t care. The world cares if we solve a problem. If you look at where the money is going in the United States, in China, in Europe, It’s not going to a university chemistry department, or to math or physics. Everything requires a multidisciplinary team.
If we are going to address climate change, for instance, we can’t rely on only “energy” people. We have to bring broad teams together. If you’re working in AI, you should not be thinking about it as if you “only” do AI. The speed of change is too great. You should be thinking broadly, across a bunch of disciplines and categories, and you should have deep knowledge in at least two or three subjects.
AR: I couldn’t agree more. The most exciting things that I’ve seen both in our team’s work and in that of parallel teams have been anchored on a problem, or a set of problems, first. People are often scared of doing this because they think it’s going to be too applied, rather than a fancy new method. But that’s where all the new meaningful things get invented. It’s the same thing with startups. My view is that you only make real progress when you anchor on a problem rather than on a role or a discipline.
“My view is that you only make real progress when you anchor on a problem rather than on a role or a discipline.”
Alex Ratner, Snorkel AI
DJP: I’ll share a quick story here. Once, when we were in the Chief Data Scientist’s office, there were a lot of questions swirling about the risks of AGI [Artificial General Intelligence]. The cliche was, “is Skynet going take over the world?” President Obama asked us for our assessment.
When we came back with our assessment, the president said, “I think you guys are missing the point.”
We felt we knew what we were talking about, of course, but he said, “you need to go back and look harder at this, because this change is going to be much more massive than I think you realize.”
We walked out asking, why is the president of the United States—the person who’s spending all his time thinking about how to make sure the world works—calling us out on this? We took a harder look at the question and ourselves and realized we might be missing the bigger picture. And that led to the first-ever report that we wrote as the National AI Strategy.
How did the president see this when the rest of us, who are technical experts, missed it? I think it was because we were too deep in our particular subjects to see the wider set of issues.
AR: From a United States standpoint, what do you think has been most surprising about what’s happened in the field of AI/ML in the last year or two and even in the last couple of months? What is currently most underappreciated about what’s ahead for us?
DJP: Well here’s what we got wrong in that first AI strategy report. We thought AI would disrupt mostly low-paying jobs—truck and delivery drivers, retail and service workers, certain kinds of manufacturing—the jobs that have historically been both dangerous and poorly compensated.
“Here’s what we got wrong in that first AI strategy report: We thought AI would disrupt mostly low-paying jobs… So far those are not the jobs that are being disrupted. The potential we see with large language models, though, has raised questions about how we might think about some higher-paying professional jobs.”
DJ Patil
But so far those are not the jobs that are being disrupted. The potential we see with large language models, though, has raised questions about how we might think about some higher-paying professional jobs. The legal profession, for one example, seems ripe for intervention from AI. In radiology, despite what many assumed about the role AI would play in radiological imagery, we’re not even close to that disruption, because it is hard to change an established professional culture and ecosystem
The other part we missed at first is that we as a country are not investing enough. This is a “Sputnik” moment. We later collectively wrote a report for the Council on Foreign Relations about this. This is a moment that requires an ethos of “go.” We need to be treating this as a national effort to go all in on what AI/ML can and should be for us.
That need for rapid investment, in turn, raises fundamental questions around ethics and support. Right now, we know that other nation-states are going “all in.” We know that this could be one of the most transformative times in history, and we need to make sure that, even as we do this full speed, we work to make the systems and other technologies we’re building operate with the value sets that we believe are critical for the future and for the legacy we leave for future generations. We need to ensure that this technology works for us rather than against us. That is an extraordinarily difficult problem.
And, for those saying the answer is simply creating a new regulatory agency, it is useful to remember that it took 12 years from the first detonation of a nuclear warhead to the creation of the IAEA [International Atomic Energy Agency]. We do not have 12 years to figure this one out.
AR: It’s interesting to think about what we are currently getting wrong. Throughout the relatively brief history of AI, we’ve made a number of significant misestimations of what AI would be good or bad at.
For example, we thought that the types of tasks that are generally more difficult for humans—like logistics planning, complex math, and playing chess—were going to be the grand challenge for AI. We thought the things that were easy to us, like simply seeing and perceiving the world around us, were going to be, essentially, a summer project. But now computer vision is an entire field that we’re still trying to get right. It makes you think about what are we still not seeing.
“As a country are not investing enough. This is a ‘Sputnik’ moment.”
Dj Patil
DJP: We surely have a lot of blind spots right now.
One of the biggest lessons I took from serving in government was that you need a very holistic team doing this work. So often, when I was working on an issue at the White House, somebody on the team would say, “have you thought about it this way?” And I’d say, “I don’t even know what that means. Can you help me understand that?”
If you have a diverse team with diverse perspectives and expertise, you realize that problems are much richer than you might have believed. We need that kind of diversity of perspective, given the complexity of the problems and of the solutions that are required to actually have a material impact on society.
AR: There’s a theory about why some of the greatest innovators have actually made their biggest achievements in a field other than the one they started in. Why cross-functional groups often achieve such great things. When people go from one field to another, they ask much more basic questions that, when you’re just in one narrowly conceived field, you’ve long ago skipped over. But then, when someone asks them, you come to realize you need to dig more.
You said that we as a country are under-investing in AI and that we need to move faster. How would you see that investment happening? Is it one big central effort? Is it corporate control or is it open? Is it distributed into lots of specialist AI and data science efforts, or is it a single mega-model or mega-institution? How do you think we should proceed at moving faster?
DJP: I try to think of all this in an analogy about calories. Your body has only so many calories. If you were to run, you could only go so far based on the calories that are in your system. But, if you have a relay team, you get more distance, because you have more net calories to burn.
If you think of the net number of “calories” that are going into working on AI right now, it’s unprecedented! Even before the dot-com bust there was nothing like this in terms of the amount of people dedicated to working on one thing. What we haven’t done effectively, however, is bring all those “calories” and the AI/ML community together to work on the really key problems.
“If you think of the net number of ‘calories’ that are going into working on AI right now, it’s unprecedented!”
DJ Patil
We should work to agree on what are the five or six biggest problems on which we could all get aligned. One that I’ve suggested, in the multi-interdisciplinary problem area that I am focused on, is the question of how AI intersects with national security. A second one is in the ways AI and data science intersects with life sciences. How might AI impact things like medical treatments, drug clinical trial design, cancer treatments, and pandemic response strategies? Then you could include climate change. You could include food scarcity. These are the big problems that AI could engage.
AR: It really highlights the importance of having a problem-centric formulation. Problem-centric funding, problem-centric anchoring, etc.
DJP: What would it take to put our biggest institutions together to attack these problems? If humanity depended on it, wouldn’t it be a good idea to bring, for instance, Merck and Google together to address some existential challenge? I’m just picking on those two at random, but that is what we saw happen during the COVID-19 pandemic, for example. You saw Apple and Google working together on apps to figure out how to create exposure notifications. You saw people working to get folks to the right testing facilities. The whole COVID-tracking project was all volunteer data scientists.
Data scientists, AI people, whatever we want to call ourselves, I see us as a new kind of “first responder.” We have the ability to help the people who are on the ground to be much more effective.
AR: One other hot-button question has been: if we set up the funding, the structure, the orientation in this interdisciplinary, problem-centric way, do you think that the right organizing framework is a smaller number of centralized players that are regulated from the top down, or— given the way a lot of data science has flourished in the recent past—is the better approach through open source, with many players leveraging unfettered access to data and resources?
“Federal policy can never be a scalpel.”
DJ Patil
This is at the core of the debate about responsibility. Do we get that through openness or do we get it through being more closed, in a sense?
DJP: First, we need to recognize that federal policy can never be a scalpel. It’s much more of a hammer. When you have something that’s very nascent you want a scalpel not a hammer. The better policy style, then, is to create guardrails as the new thing evolves.
Take, for example, the Genetic Information Non-Discrimination Act, or GINA. If you read that today you’d ask, when did people write this and why hasn’t it been updated? HIPAA [Health Insurance Portability and Accountability Act], for another example, is a confusing law or set of laws. It is there to protect your healthcare data. But these laws don’t get regularly or appropriately updated over time.
Once you set down a law, you shouldn’t expect it to get updated for 30-plus years. That has real material ramifications for society.
“Once you set down a law, you shouldn’t expect it to get updated for 30-plus years. That has real material ramifications for society.”
DJ Patil
Our “North Star,” so to speak, should be: How do we ensure AI technology works for us rather than against us? Some of that needs to be self-policing and regulations—because some people will abuse the technology. In other words, there’s a reason that when we take Tylenol, we trust that it isn’t counterfeit. When someone unfortunately needs chemotherapy, we never question whether the medication they received was legitimate. And that’s because our process is good.
Maybe certain categories of AI systems need to be held to higher standards than most other things. That’s okay.
Taking an even bigger step back, how do we create the world we all hope to live in? To do that we need to have really good people talking a lot and sharing what works and how they think. We need more people thinking, espousing, and workshopping ideas and asking ourselves, how do we solve these problems as a “we,” rather than as a few people with a narrow political or ideological perspective? It needs to be community-driven, and that community cannot be only tech people. It must include those who will be impacted. They have to have a seat at the table.
We must remember the data points have names. We love to talk about “parameters.” We forget about the names and the people those parameters represent. When we get out of our nice, cozy data labs or offices, we get out into the world and see real people.
I’ve spent a lot of time in healthcare. We have a healthcare company we’ve built, Devoted Health, where our mission is to take care of people who are very sick. A lot of people say, “hey, give them a wearable device, give them something fancy.”
Do you know what they need? They need air conditioners because they live in environments where the humidity inside is causing mold. They need exterminators, they need food, they need help with loneliness and social isolation. If we go back to core needs—food, clothing, water, shelter— and move up from there, that’s a good approach.
“We must remember the data points have names. We love to talk about ‘parameters.’ We forget about the names and the people those parameters represent.”
DJ Patil
What I would encourage in anybody who wants to work on these problems is to not be shy about getting out of your comfort zone and into the environment to get that context. Then you can take all your skills and run at it hard with all this technology, because then you will have a much better appreciation for the human dimensions of those problems.
Alex Ratner: One last question. You were instrumental in President Obama’s executive order making U.S. data more open. A big focus for the data science community is how the engineering of the data, and data access, is make-or-break for AI—even moreso than some of the fancy downstream things that we like to teach in data science about tweaking models and algorithms. That’s the point of this data-centric idea.
From that experience and from others as well, what are some of the lessons you’ve learned about not just opening up data but also making it usable for data science and machine-learning techniques?
DJP: First, why open the data? It’s not the federal government’s data, it’s your data. It is our data. If it’s hidden behind a wall, that breaks the promise of that data. Of course it should be opened in a responsible way and we need to make sure it’s got all the right controls and guardrails. But fundamentally, we should put that data out there so people can build on it. Personally, my career got started using open-source data.
We should be figuring out how to open up more data. How do we collect data that is going to be helpful and that gives us clarity on issues? By opening up data and being able to share, for example, “this is how much an MRI costs here, versus here,” then somebody might be able to use that data to help people who need MRIs to get to the best place. I can pull up a dozen ways to figure out where the cheapest flight is, but I still can’t figure out where to get the right healthcare.
“It’s not the federal government’s data, it’s your data. It is our data.”
DJ Patil
That’s where entrepreneurship happens. Government can play its part by opening up data or providing the seed capital that creates a spark. But then it’s our job to come in and make this all work to the benefit of everyone. That is the social contract of innovation that we should have as a society.
AR: I couldn’t agree more. Thank you so much for taking the time today. It was awesome to get to chat with you.
More Snorkel AI events coming!
Snorkel has more live online events coming. Look at our events page to sign up for research webinars, product overviews, and case studies.
If you're looking for more content immediately, check out our YouTube channel, where we keep recordings of our past webinars and online conferences.