Using generative AI to learn is like Odysseus untying himself from the mast
Are we solving a technological problem, or an agency problem?
I’ve been thinking a lot lately about how AI will affect classroom learning. The answer isn’t obvious, because it depends on how it’s used. Technology has been in the classroom for a long time (remember one laptop per child?).
But unlike simple laptop or internet access, AI enables personalized learning at scale. With traditional web search the inputs are personalized, but the outputs are not. You can type anything you want into a Google search bar, and it will give you a ranked list of webpages containing the information you are seeking. The list depends on your exact query, but the webpages you click through to see look the same to you as they do to everyone else.
With generative AI, both the inputs and the outputs are fully personalized. Every time you ask ChatGPT a question, you get a unique response that reflects your conversation history and what the chatbot knows about you. The personalization is what makes AI feel like magic. And yet personalization also creates temptation. Generative AI tools are so flexible, you can ask them anything, and they’ll never tell you to stop messing around and get back to work.
A vivid illustration of our divided self comes from a famous behavioral economics paper called “Tying Odysseus to the Mast: Evidence from a Commitment Savings Product in the Philippines”. They found that customers flocked to and greatly benefited from a bank product that prevented them from accessing their own savings in the future. Just like when Odysseus tied himself to the mast of his ship so that he would not be tempted by the alluring song of the Sirens.
The Sirens are often portrayed as sexual temptresses in art and popular culture. But Homer never describes the Sirens bodies or gives any sense that their physical allure. Here is a translated excerpt from the 12th book of the Odyssey (emphasis mine) – “For never yet has any man rowed past this isle in his black ship until he has heard the sweet voice from our lips. Nay, he has joy of it, and goes his way a wiser man. For we know all the toils that in wide Troy the Argives and Trojans endured through the will of the gods, and we know all things that come to pass upon the fruitful earth.”
The Sirens offer Odysseus the promise of unlimited knowledge and wisdom without effort. He survives not by resisting his curiosity, but by restricting its scope and constraining his own ability to operate. The Sirens possess all the knowledge that Odysseus seeks, but he realizes he must earn it. There are no shortcuts. This is the perfect metaphor for learning in the age of superintelligence.
Personalization is incredibly effective for learning
Nearly all the great thinkers and aristocrats of antiquity were tutored intensively from a very young age. Many of the most successful charter and other non-traditional schools invest in high-dosage small-group tutoring for students who fall behind. Tutoring works. The problem is that it’s very expensive to maintain a 1:1 student-teacher ratio. So teachers of larger classes target instruction to the median level of academic readiness, which leaves some students lost and others bored.
That’s why tracking works so well, despite its negative connotations. In a large, randomized experiment in Kenya, 60 schools randomly assigned students into classes and the other 60 split them by a baseline test score (the top half in one class, the bottom half in the other). Test scores were higher in the “tracked” schools for all students, including the low achievers, and the benefits persisted after the experiment concluded. As the researchers showed, this was because the narrower range of academic readiness in each classroom enabled teachers to get closer to meeting each student’s specific learning level and pace. Tracking is an intermediate solution on the pathway to 1:1 tutoring, e.g. full personalization.
A pre-AI personalization technology called computer-assisted instruction (CAI) found huge positive impacts on learning.1 The researchers randomly offered middle school students in India access to an educational program called Mindspark, an adaptive learning platform that dynamically assesses students’ mastery of material and tailors subsequent content to remediate the gaps in their understanding. They estimate that 90 days of Mindspark increased math and Hindi test scores by 0.59 and 0.36 standard deviations respectively. These are some of the largest effect sizes I’ve ever seen, and the authors argue it is because of personalization. They gave students a baseline test and found that the typical classroom spans 5-6 grade levels in terms of prior academic preparation. Teachers can’t possibly meet everyone’s needs under those circumstances, and so personalization makes a huge difference.
Just this year the authors conducted a large-scale replication of Mindspark, scaling up the original study more than 10x over an 18-month period. The effect sizes were a bit less than half the size of the first experiment (~0.2 standard deviations) but still massive. Students who spent a full 18 months in the program gained an average of 1.7 grade levels in Math and 2.1 grade levels in Hindi. They estimate that Mindspark increased learning productivity (e.g. achievement gains per unit of learning time) by 50-66 percent. Personalization offers the promise of a massively better and more cost-effective educational experience in schools all around the world.
So why is personalized AI such a mixed bag?
Because personalization is so important for learning, and AI is so good at personalization, you might expect AI to be a big boon to classroom learning. The best evidence I’ve seen suggests a more complicated story. AI-powered personalization can lead to distraction and cognitive offloading.
A recent paper in the Proceedings of the National Academy of Sciences, descriptively titled “Generative AI without guardrails can harm learning”, illustrates the issue well. The authors conducted a large, randomized experiment in math classes in a Turkish high school in Fall 2023, when GPT-4 was a frontier model. Students first sat through a standard lecture that introduces a topic. Next they participate in an assisted practice session with varying forms of assistance. Finally, they take an exam with no assistance or resources available.
The randomization occurred exclusively in part two, the study sessions. The control group worked on problems with only their course books and notes. The first treatment group (called GPT Base) was “out-of-the-box” access to GPT-4 via a laptop. The second treatment group (GPT Tutor) made two important modifications to GPT Base. First, the model was instructed not to give students the answer directly, similar to programs like KhanMigo and ChatGPT study mode that act as patient, Socratic tutors. Second, the teachers feed GPT Tutor specific information about the practice problems, including correct solutions, common mistakes, and recommendations for feedback.
Students engaged with both AI tutors very intensively, sending multiple messages per problem, and both increased student performance in the practice sessions. However, students in the GPT Tutor treatment arm did not do any better on the unassisted exam, and the GPT Base group performed significantly worse than the control group. AI without guardrails harmed learning.
They asked students in each group how they thought they did. Students using GPT Base had similar perceptions of their own performance as students in the control group, even though they did worse. Students using GPT Tutor thought they did much better, even though they didn’t. Students thought AI was helping them learn, but instead it was a crutch, because it was doing the thinking for them.2
AI as an exoskeleton
The right way to think about the impact of AI on learning comes from the highly descriptive title of another very nice paper - “GenAI as an Exoskeleton”.3 The authors study Boston Consulting Group (BCG)’s effort to train nearly a thousand management consultants to write code in Python and perform advanced data science tasks. Half of them were given access to ChatGPT, and the other half were not. The workers using ChatGPT did way better, nearly meeting the performance standards of actual BCG data scientists. But when asked a set of data science questions in a post-experiment survey where ChatGPT was unavailable, the workers in the treatment group performed no better than the others. Taking the AI away eliminated workers’ temporarily boosted data science skills, like Spiderman taking off his suit.
This isn’t unique to AI. A study from more than a decade ago found that advancements in autopilot technology had dulled Boeing pilots’ cognitive and decision-making skills much more than their manual “stick and rudder” skills. They put the pilots in a flight simulator, turned the autopilot off, and studied how they responded. The pilots who stayed alert and while the autopilot was still on were mostly fine, but the ones who had offloaded the work and were daydreaming about something else performed very poorly. The autopilot had become their exoskeleton.
So what is a good way to use AI for learning? One hopeful sign comes from a very recent paper showing that AI can help with long-run learning if you first do the work yourself. The study evaluates tutoring program that activates AI assistance only after a user first submits a solution. I don’t know if that’s the right approach, but it seems like a good start.
Learning is hard work. And there is now lots of evidence that people will offload it if given the chance, even if it isn’t in their long-run interest. After nearly two decades of teaching, I’ve realized that my classroom is more than just a place where knowledge is transmitted. It’s also a community where we tie ourselves to the mast together to overcome the suffering of learning hard things.
Shortly after ChatGPT was released Sam Altman famously tweeted “wait better: ChatGPT is like an e-bike for the mind”. I presume he was trying to one-up Steve Jobs’ famous quote that computers should be like a bicycle for the mind. The implication is that e-bikes are better than regular bikes, because they can help you travel even farther and faster. But if the goal is exercise rather than a destination, you may not want the pedal assist.
One of the coauthors (Alejandro Ganimian) is my former student. Go Ale!
This study happened way back in 2023, the dark ages of AI. GPT Base sometimes got problems wrong and misled students, which could explain the negative impact on performance. A hopeful implication is that today’s frontier AI models would do much better. However, I don’t find that persuasive, because the GPT Tutor was fed the correct answers in advance as well as a highly customized diet of tips and tricks for students. It was about as helpful as it could possibly be, yet it still had no impact on performance after it was taken away.
The actual paper title is “GenAI as an Exoskeleton: Experimental Evidence on Knowledge Workers Using GenAI on New Skills”, but I think the shorter title goes much harder.


So much good and important here, but I'm going to push back on personalization. The smartest users program their models to do a "consensus audit" before it answers "for you." This means what you get back is some version of "this is what most people think but you may be looking at the question differently." (Obviously this answer is more relevant to, say, causes of the Civil War than the atomic mass of aluminum.) A consensus audit is an excellent pedagogical tool because it positions students more actively in relation to knowledge. It invites students to be participants in learning, thinking of themselves as part of a larger community of thought and discovery.
The agency framing is persuasive at the student level, but it seems incomplete institutionally. K–12 isn’t governed primarily by learning outcomes; it’s governed by labor politics. AI-mediated, struggle-first models would materially change teacher roles, which may explain why schools are banning AI rather than constraining it. How does your argument account for that resistance?