AI-human teams and the future of work
The rewards will go to people with both deep expertise and the social awareness to recognize their own limitations
Hi folks!
It’s been a minute. Today I want to talk about some exciting new research on AI and teamwork – some by me, and some by an all-star Cybernetic team of coauthors.
Some people think that the AGI apocalypse is coming for office jobs sooner rather than later. I’m skeptical, but if it does happen, it will mean that AI agents have figured out much more than how to write TPS reports. They’ll also need to navigate endless meetings, including some that should have been an email.
I have spent much of my academic career studying teamwork, and I’m very excited about the potential of human/AI collaboration. In short, I think AI’s going to be a great teammate because it can smooth out our rough edges, allowing each of us to specialize in what we do best.
Measuring human leadership skill with AI agents
This morning Ben Weidmann, Yixian Xu and I released a new paper called “Measuring Human Leadership Skill with AI Agents”. This is the newest paper in our overall agenda at the Harvard Skills Lab to create scientifically rigorous, easily usable measures of soft skills like teamwork, leadership, and decision-making.
In a large pre-registered lab experiment, human leaders work with AI agents to solve problems. We then compare the leaders’ performance with AI agents to “ground truth” estimates of their leadership skill in human teams. The ground truth estimates are based on a method that Ben and I developed in prior peer-reviewed work, which involves repeated random assignment of leaders to different groups of teammates and careful controls for baseline skills in the task.
Groups with one leader and three teammates (either humans or AI agents) worked on a series of puzzles called Hidden Profile problems.1 Each member of the team is given some - but not all - of the information required to solve the problem, plus lots of irrelevant context. The challenge is to surface important information through dialogue. Team members can only talk to the leader, but the leader can talk to anyone. The leader is also responsible for submitting answers and managing the time. Good leaders ask questions and manage the conversation, surfacing useful information and discarding the rest.
Below is a visual overview of the experiment. Each participant completes some individual assessments, then is randomly assigned to either do the human or the AI leadership arm first (this is to prevent learning and order effects). They complete 6 group puzzles in each condition.
Here is an overview of how group problem-solving works and a screenshot of the answer submission screen:
Some leaders are better than others
Since we observe leaders solving puzzles with many different teammates, we can figure out how much leaders actually matter. If performance bounces around equally for everyone, maybe puzzle difficulty is much more important than individual leadership skill.
That is not what we find. Leaders matter – a lot. A good leader (1 SD better than average, roughly the top 15 percent) solves 53% of hidden profile problems, while a bad leader (e.g. bottom 15 percent) solves only 10%. More than half of the variance in group performance can be explained only by the identity of the leader. Interestingly, leaders matter just as much with AI teammates as they do with humans (e.g. we cannot reject that the magnitude of leader effects is the same in the two conditions).
After accounting for measurement error, the correlation between leader effects with human teammates and leader effects with AI teammates is 0.81, with a 95% confidence interval that is between 0.72 and 0.88. That’s super high! For context, I asked GPT-4o to give me some examples of real-world phenomena that are correlated around 0.8 at the individual level. Here’s what it came up with:
Exercise frequency and cardiovascular fitness (VO2 max)
Years of education and individual income
Smartphone ownership and social media usage
Study time and academic achievement
Height and blocks per game in the NBA
What predicts being a good leader?
The figure below presents a scatterplot of the correlations between leadership skills and various traits. IQ, decision-making skills, and emotional perceptiveness are all decently strong predictors of leadership skills. Age, gender, and education are not. Skills are equally predictive with people and with AI agents when they are close to the 45-degree line. Because performance with other people and with AI agents is so highly correlated, the same traits predict leadership skills in both samples.2
All the results above were pre-registered, meaning we submitted a document declaring how we would conduct the study and which tests we would use before we actually ran the experiment. We also conducted some exploratory analyses, mostly trying to understand what was happening under the hood. Interestingly, pure volume of communication does not predict leadership performance. However, good leaders asked more questions, engaged in more conversational “turn-taking”, and were more likely to use plural pronouns like “we” and “us”. This echoes existing research on good leadership, which gives us some validation.
Why go to the trouble of measuring leadership skills with AI agents?
Most existing tests of leadership skills are pen-and-paper questionnaires that ask people to reflect on their own capabilities. We prefer performance-based metrics – can you be a good leader or teammate when we put you in a real group? It seems logical that if you want to measure someone’s teamwork skills, you need to put them on an actual team, not just give a test.
Our method repeatedly randomly assigns people to teams and measures whether their performance is reliable across many different teammates. It works really well and is scientifically validated (we published the methods paper in Econometrica, and Nobel laureate Guido Imbens was the editor). But it’s a pain in the neck to get done. You have to bring a bunch of people into a lab and hold their attention long enough to get them to work with lots of different randomly assigned groups.
(We did this work in a lab because – surprise! – most companies weren’t too excited about letting us randomly assign their employees to different work teams.)
It’s much easier to use AI agents as teammates, because you can take the test from the comfort of your laptop. We have built a short performance-based test of leadership that anyone can take by themselves, using the power of ChatGPT.
AI-human collaboration and comparative advantage
These days, office work is very social – according to O*NET data, about 80% of U.S. employment is in jobs where teamwork is a “very” or “extremely” important part of the job. AI agents will have to collaborate with people if they are going to be economically useful.
We find that collaboration with AI agents is a lot like human collaboration. Therefore, existing studies of teamwork can help us learn about the benefits of integrating AI into existing workflows.
It’s first worth asking – why does teamwork exist at all? The economic benefits of teamwork derive from the theory of comparative advantage. In 1817, David Ricardo developed the theory of comparative advantage to understand how two countries could mutually benefit from free trade. If England is better at making cloth and Portugal is better at making wine, they should specialize and trade, allowing each country’s citizens to have more wine and more cloth than would otherwise be available (a lesson our country’s current leadership seems to have forgotten.)
The same analogy holds for teamwork. Work requires many different tasks (writing, making presentations, talking to clients, analyzing budgets, etc), and teams can get more done if individual workers specialize in their expertise and “trade tasks” with their teammates.
Ok, so if teamwork is so great, why isn’t the economy just one big team? Because coordination is costly. If you and I are going to work on a project together, we need to meet, figure out what we want to get done, and divvy up tasks. Meetings are the price we pay to realize the benefits of having specialized skills. Sometimes the price is too high, which explains why meetings can occasionally feel like a huge waste of time even though we all agree they are necessary.
Social skills - lowering coordination costs to realize the benefits of teamwork
In a 2017 paper called “The Growing Importance of Social Skills in the Labor Market”, I made the case that social skills are economically valuable because they lower the cost of coordinating with others. People who have good social skills do a better job of “fitting in” to different teams and understanding where they add value. Being a good team player requires you to understand your own skills, but also the strengths and weaknesses of your teammates. If I am writing a paper with one of my graduate students, we’ll arrive at a division of labor that is quite different than if I write a paper with my advisor (Larry Katz) or one of my peers. Good social skills require you to understand others, and to understand yourself.
We measure social skills with a test of emotion perception called the Reading the Mind in the Eyes Test (RMET).3 RMET has a long, distinguished history in psychology. The RMET predicts who will be a good team player in our Econometrica paper and in both the human and AI samples in the study discussed today. In a related paper, we’ve shown that people who are well calibrated – meaning they know how good they are at a task – benefit much more from an AI-based decision aid.
The bottom line is that skills like self-awareness and social perceptiveness are important predictors of who can work well with AI. Once AI agents become sophisticated enough to join workplace teams (it’s a matter of when, not if), I believe they will be adopted very rapidly, because the coordination costs of working with AI are probably much lower than working with humans. The AI agents will be available 24/7, they speak every language, and they never get tired or complain.
What about gains from task trade? A consistent finding from the many early studies of generative AI is that it levels the playing field by helping novices much more than experts. Think of AI as a teammate that is above average at many things but truly expert in nothing (at least not yet). That suggests that the people who benefit most will be those who have deep expertise in a few high-value tasks and who can work well with AI to cover up their critical deficiencies.
Early evidence on AI “Cybernetic” teammates
I was excited to read this recent paper by an all-star team of coauthors called “The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise” because its findings are very consistent with my armchair theorizing about the impact of AI on teamwork.
They randomly assign workers at Procter & Gamble, a global consumer goods company, to product development teams where some individuals work alone, some in two-person teams, and some receive AI assistance. They were given current business problems identified by senior leaders (e.g. “how do we motivate new customers to try product X?”) along with lots of market information and other data, and then blind-graded by experts on the quality of their solutions. Here is what they found:
AI noticeably increased the quality of solutions, and individuals working with AI outperformed the human-only teams. When they split the data between participants with commercial vs. technical backgrounds, they find that individuals working alone tend to produce “unbalanced” solutions that favor their individual expertise. AI assistance makes that distinction disappear, and both types of workers produce more balanced solutions (Figure 6 in the paper).
Finally, although there was no difference between individuals and teams when both had AI, the team + AI arm was significantly more likely to come up with solutions that scored in the top 10% (Figure 9).
Overall, this clever experiment shows that AI can be an effective teammate by providing above-average performance in the tasks where workers have little or no expertise. To me, this suggests that deep expertise combined with the social skills and self-awareness to know your own weaknesses is a winning combination in the AI-fueled labor market of the future.
Hidden profile problems are commonly used in other studies and online experiments. One example problem asks the team to classify a rare fish and diagnose its illness. Some clues are public, and some are revealed only to one team member. Some clues rule out specific options (e.g. Report #7 indicates the fish does not have yellow scales, so its not the Yellowfish), while others are just distractors (e.g. Report #5 indicates that the Greenfish migrate to access seasonal food resources.) We created new problems that weren’t in the AI’s training data, and added features like creating a distinct role for the leader and introducing probabilistic answers (e.g. it’s not the Yellowfish so it’s 50% likely to be either Greenfish or Redfish).
Eagle-eyed readers will notice that emotional perceptiveness is a little bit more predictive in the human sample than in the AI agent sample. That makes sense, but the difference itself isn’t statistically significant, so we don’t emphasize it too much.
Ben and Yixian recently developed a modern version of the RMET that uses generative AI tools to create faces that can express a wider range of emotions and intensities and can be made representative of different races and ethnicities. We use that test – called PAGE (Perceiving AI-Generated Emotions) in this most recent paper.
Where can we find this? “ We have built a short performance-based test of leadership that anyone can take by themselves, using the power of ChatGPT.”
Great post as always. Would it be valid to say (given the results of the HBS paper) that there is a concern that AI + human output will lead to more uniformity?