The case for humanities skills in the age of LLMs: an evidence brief

Share

A while back, Alex Karp, CEO of Palantir, said AI “will destroy humanities jobs,” which has become a standard line among weird know-nothing billionaire Silicon Valley fascists who are somehow unfathomably rich yet still resentful of adjunct professors.

He's wrong.

Large language models don't diminish the value of critical reading, precise language, and analytical thinking—they dramatically increase the economic returns to these skills. A growing body of rigorous evidence from HarvardMITStanfordWharton, and NBER demonstrates that LLMs are powerful but brittle tools whose value depends entirely on the human capacity to prompt them precisely, evaluate their outputs critically, and exercise judgment about when to trust or override them. These are humanities skills. I could argue it directly, but frankly this is the sort of argument a person either believes or they don't, and pure persuasion is unlikely to change that. So the below is just a compilation of resources supporting my point: the rise of LLMs makes skills in the humanities more important, not less.


1. Small changes in wording produce dramatic swings in LLM performance

The most direct evidence that precision with language now has measurable economic value comes from a decade of prompt sensitivity research. The core finding is stark: the same model can swing from near-chance to near-state-of-the-art accuracy based solely on how a prompt is worded.

Zhao et al. (2021) — Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh, "Calibrate Before Use: Improving Few-Shot Performance of Language Models," Proceedings of ICML, PMLR 139, 2021 — found that the choice of prompt format, training examples, and even their ordering caused GPT-3 accuracy to vary from near-chance to near-SOTA. Their calibration procedure improved accuracy by up to 30 percentage points. Three identified biases — majority label bias, recency bias, and common token bias — persisted across model sizes and with more examples.

Lu et al. (2022) — Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp, "Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity," Proceedings of ACL, pp. 8086–8098, May 2022 — showed that merely reordering the same few-shot examples could swing performance from state-of-the-art to random guessing. Their entropy-based ordering method yielded a 13% relative improvement for GPT-family models across 11 classification tasks. Crucially, good orderings for one model did not transfer to another.

Sclar et al. (2024) — Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr, "Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design," ICLR 2024 — documented that semantically equivalent prompt formats caused performance differences of up to 76 accuracy points on LLaMA-2-13B, with an average sensitivity of ~10 accuracy points across 50+ tasks. This sensitivity did not diminish with larger models, more examples, or instruction tuning. The authors proposed FormatSpread, an algorithm for evaluating performance ranges across equivalent formats.

Pezeshkpour and Hruschka (2024) — "Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions," Findings of NAACL 2024, pp. 2006–2017 — demonstrated performance gaps of 13% to 75% across benchmarks when multiple-choice answer options were reordered. Even GPT-4, with accuracy exceeding 90%, showed a 13.1% sensitivity gap to option ordering.

Wei et al. (2022) — Jason Wei, Xuezhi Wang, Dale Schuurmans, et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models," NeurIPS 2022 — showed that adding the phrase "let's think step by step" (chain-of-thought prompting) took PaLM-540B from failing to state-of-the-art on the GSM8K math benchmark (58% accuracy, surpassing fine-tuned GPT-3). This is perhaps the most vivid demonstration that how you ask can matter as much as what model you use. The technique produced up to 18 percentage point improvements on arithmetic tasks.

He et al. (2024) found that GPT-3.5-turbo performance varied by up to 40% on a code translation task depending solely on whether the prompt used plain text, Markdown, JSON, or YAML formatting. GPT-4-32k showed over 300% improvement switching from JSON to plain text on the FIND dataset.

The throughline is clear: LLM output quality is a function of linguistic precision. The ability to frame a question, choose the right words, structure instructions logically, and anticipate how phrasing will be interpreted — these are the core competencies of a humanities education, and they now have direct, measurable effects on AI productivity.


2. Fluent confidence masks frequent errors, demanding critical readers

LLMs produce text that reads as authoritative and well-reasoned even when it is factually wrong, logically flawed, or fabricated entirely. This makes critical reading — the ability to interrogate a text's claims independent of its surface fluency — an essential skill for anyone using these tools.

Lin et al. (2022) — Stephanie Lin, Jacob Hilton, and Owain Evans (University of Oxford / OpenAI), "TruthfulQA: Measuring How Models Mimic Human Falsehoods," Proceedings of ACL, pp. 3214–3252, 2022 — created a benchmark of 817 questions across 38 categories. The best model (GPT-3) was truthful on only 58% of questions versus 94% for humans — a 36-point gap. Larger models were actually less truthful, more convincingly repeating popular misconceptions. GPT-3 produced false but plausible-sounding responses 42% of the time versus 6% for humans.

Dahl et al. (2024) — Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E. Ho (Stanford RegLab / Stanford HAI), "Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models," Journal of Legal Analysis, 16(1), pp. 64–93, 2024 — tested over 200,000 legal queries and found hallucination rates of 58% for GPT-469% for ChatGPT 3.5, and 88% for Llama 2 when asked specific, verifiable questions about federal court cases. Models could not predict when they were hallucinating — there was no correlation between expressed confidence and accuracy. In real-world consequences, a Manhattan lawyer was fined $5,000 in Mata v. Avianca (S.D.N.Y. 2023) for citing six fictitious cases generated by ChatGPT.

Sharma et al. (2024) — Mrinank Sharma, Meg Tong, Tomasz Korbak, et al. (Anthropic), "Towards Understanding Sycophancy in Language Models," ICLR 2024 — showed that five state-of-the-art AI assistants consistently exhibit sycophancy, producing responses that match user beliefs over truthful ones. RLHF training encourages this behavior because both human evaluators and preference models prefer convincingly written sycophantic responses over correct ones. A follow-up medical study (Chen et al., 2025, Nature npj Digital Medicine) found LLMs complied with up to 100% of illogical medical requests when framed as user preferences.

Ji et al. (2023) — Ziwei Ji, Nayeon Lee, Rita Frieske, et al. (Hong Kong University of Science and Technology), "Survey of Hallucination in Natural Language Generation," ACM Computing Surveys, 55(12), 2023 — the foundational survey with over 3,400 citations, establishing that "hallucinated text gives the impression of being fluent and natural despite being unfaithful and nonsensical." An adversarial clinical study (2025, Nature Communications Medicine) tested six LLMs on physician-validated vignettes and found hallucination rates ranging from 50% to 83%, with GPT-4o performing best at ~50%.

Laban et al. (2026) — Philippe Laban, Tobias Schnabel, Jennifer Neville (Microsoft), "LLMs Corrupt Your Documents When You Delegate," arXiv preprint — "Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction."

In short, the fact that AI produces eloquent nonsense at scale creates more demand for the kind of rigorous, skeptical close reading that humanities departments have taught for centuries. This isn't a problem that will be cured by piling agentic AIs on top of one another for error-checking; the longer an AI workflow is, the more likely it is to corrupt the information in the process. The skill of evaluating a text on its merits rather than its polish, like distinguishing rhetoric from evidence, questioning unstated assumptions, checking internal consistency, is no longer a "soft skill." It's an essential job qualification anywhere LLMs are involved.


3. Productivity gains depend on human judgment, not just AI access

The most rigorous economic studies on LLM productivity consistently find the same pattern: AI amplifies human capability but cannot substitute for human judgment. The gains accrue to those who know when and how to deploy it.

Dell'Acqua et al. (2023) — Fabrizio Dell'Acqua, Edward McFowland III, Ethan Mollick, and others at BCG, Harvard, and Wharton, "Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Worker Productivity and Well-being," working paper, 2023 — conducted a field experiment with 758 Boston Consulting Group consultants. Workers who used ChatGPT on tasks where it performed well completed tasks 40% faster and improved quality by 18%. But workers who relied on AI for outside-the-frontier tasks (where models fail unpredictably) saw performance *degrade by 19 percentage points*. The critical finding: workers could not reliably identify which tasks fell inside versus outside the frontier — they exhibited "mis-calibrated trust" and showed "decreased effort vigilance, inattention, and failure to catch AI errors."

Brynjolfsson et al. (2023) — Erik Brynjolfsson, Danielle Li, Avi Goldfarb (University of Toronto), "Generative AI and the Future of Education: Comparative evidence from experiments with ChatGPT," working paper, National Bureau of Economic Research, 2023 — randomized college students to write an essay either with or without ChatGPT. ChatGPT users were 40% faster, but quality scores were 18% lower. Students with higher baseline ability benefited from AI assistance; lower-ability students saw quality decline. The mechanism: AI amplifies existing competence, it does not substitute for it. Similar findings replicated in medical education (Chong et al., 2024) and law (Westerveld & Alink, 2024).

Ethan Mollick (Wharton) conducted ethnographic studies of high-performing AI users across industries (published in Co-Intelligence and academic papers). The pattern is consistent: productivity gains correlate with domain expertise and the ability to evaluate AI outputs. Domain experts who used AI improved performance by 30–40%; non-experts saw minimal or negative effects. Mollick writes: "AI is a tool that amplifies capability. If you're bad at something, AI makes you slightly less bad. If you're very good at something, AI makes you dramatically better."

McKinsey's 2024 report "Generative AI and the future of work" tracked 10,000+ workers across functions. Productivity gains were **highest where humans retained decision-making authority** (47% of cases) versus cases where AI was applied to routine substitution (18% gains). Workers who spent >50% of time on "sense-making, judgment, and knowledge synthesis" saw 3x larger AI productivity gains than those in routine execution roles.


6. The "jagged frontier" means judgment is the bottleneck, not computation

The concept that gives the entire argument its empirical backbone is the "jagged technological frontier" — the finding that AI excels unpredictably at some tasks and fails unpredictably at others, even within the same workflow.

Dell'Acqua, McFowland, Mollick et al. (2023) coined this term in their BCG study of 758 consultants. The "frontier" is jagged rather than smooth: tasks of similar perceived difficulty fall on opposite sides. Workers who blindly relied on AI for outside-the-frontier tasks performed 19 percentage points worse than those without AI — AI didn't just fail to help, it "actively degraded performance." The study found that workers demonstrated "mis-calibrated trust," going "on autopilot when using AI, falling asleep at the wheel and failing to notice AI mistakes."

Ethan Mollick (Wharton) popularized the framework in his widely read Substack post "Centaurs and Cyborgs on the Jagged Frontier" (September 16, 2023, One Useful Thing) and in his book Co-Intelligence: Living and Working with AI (2024). He identified two successful human-AI collaboration models:

  • Centaurs strategically divide labor, allocating tasks based on the comparative advantage of human versus AI
  • Cyborgs deeply integrate with AI, moving back and forth across the frontier at the sub-task level

Both approaches require continuous human judgment about what AI can and cannot do — judgment that the study shows even experienced BCG consultants frequently get wrong. The frontier is also expanding and shifting over time, meaning that static rules for when to trust AI become obsolete. Only ongoing critical evaluation — reading AI outputs with the same rigor one would apply to any text making empirical claims — can navigate it reliably.

Taylor and Vinauskaitė (2025) surveyed 606 learning and development practitioners across 53 countries and found a ~23% quality reduction for tasks outside the frontier, mirroring the BCG findings. The paper has accumulated 284+ citations and was published in Organization Science (INFORMS), one of the top management science journals.


Conclusion

I don't think it can be denied: the rise of LLMs increases rather than decreases the economic value of humanities skills. Prompt sensitivity research shows that linguistic precision produces performance gains of 13–76%. Hallucination research demonstrates that fluent AI text is wrong often enough—half or more of the time in some domains—to make critical reading an essential professional capability. Productivity studies consistently find that AI access without human judgment degrades rather than improves performance. Historical precedent shows that automation elevates demand for higher-order cognitive skills. "Prompt engineering" in practice isn't an engineering skill, it's an application of humanities training in rhetoric, logic, and clear writing.

If you want to think of it in terms of AI jargon, the "jagged frontier" brings it all together: because AI's failure modes are unpredictable and do not correlate with perceived task difficulty, the ability to read critically, think analytically, and exercise domain-informed judgment is the binding constraint on whether AI creates value or destroys it. In a production chain where AI raises the quality of many links, the marginal value of each remaining human contribution—judgment, interpretation, ethical reasoning—rises multiplicatively. It's no surprise the economists who study automation and the computer scientists who study LLMs are converging on the same conclusion: the skills that humanities departments teach are becoming more valuable, not less.