Monday, September 29, 2025

Ethan Mollick on the Boom in Agentic AI - Why It's Here, Why it Matters

Outstanding substack article by academic Ethan Mollick on the revolution in agentic AI.

https://www.oneusefulthing.org/p/real-ai-agents-and-real-work

The article contains two citations, which I gave to Chat GPT 5 along with Mollick's article.

See Patwardham on task models.  See the multi-hour tasks here, click to expand each box. 
See Sinha on AI returns.



### Rest of blog by Chat GPT 5.

Ethan Mollick’s recent essay argues that AI has reached a tipping point: it can now complete real, economically valuable tasks, not just academic benchmarks. Drawing on OpenAI’s new GDPval benchmark, he notes that frontier models like GPT-5 and Claude Opus 4.1 produce work that rivals experienced professionals in about half of the tested cases, with most shortcomings tied to formatting and instruction-following—areas rapidly improving. 

Mollick highlights how autonomous agents can now execute long, multi-step workflows such as replicating a published economics study, potentially transforming whole domains of work. The GDPval study itself evaluates 1,320 realistic tasks across 44 occupations and shows that hybrid workflows—where experts use AI for first drafts but retain oversight—can cut costs by roughly 60% and time by 40%. 

A second supporting paper, “Illusion of Diminishing Returns,” explains why agentic ability has surged: small gains in per-step accuracy compound into exponential improvements in the length of tasks a model can complete. It also identifies “self-conditioning,” in which standard LLMs compound their own mistakes, and shows that newer “thinking” models largely overcome it, enabling far longer autonomous task execution. 

Together these works suggest AI’s immediate impact will be task-level transformation and productivity gains, not wholesale job replacement.


###

Here’s a structured synthesis of the three key pieces you asked about — Ethan Mollick’s Substack essay (Sept 29 2025) and the two supporting papers: GDPval (Patwardhan et al., OpenAI) and Illusion of Diminishing Returns in AI (Sinha et al.).


1. Mollick’s Framing: “Real AI Agents and Real Work”

Mollick’s post argues that AI has crossed a threshold: it can now perform real, economically relevant tasks, not just benchmark puzzles.

  • Task-not-Job distinction.
    OpenAI’s new benchmark (GDPval) asked experts to design tasks that take 4–7 hours for a skilled human (finance, law, retail, etc.).  See tasks.

  • Current frontier models perform nearly at human level, but the “gap” is mostly due to formatting, instruction-following, and presentation — things that are improving fast.
    ⇒ AI can do many tasks, but not all the heterogeneous human interactions that constitute a full job.

  • Replication example.
    Mollick describes giving Claude 4.5 a published economics paper plus replication data. The model autonomously read the paper, converted STATA → Python code, and successfully reproduced the findings — something that normally takes a human expert many hours.
    ⇒ Shows that an entire workflow step (checking published research) may shift from a human-bottleneck to a scalable AI-driven process.

  • Rise of agents.
    Progress has come not merely from larger models but from agentic ability: the capacity to execute long multi-step workflows using tools (code, search, file handling) with less human babysitting. Small improvements in model reliability compound to large gains in how long a chain of steps an agent can carry out.

  • Two paths for organizations.

    1. Cost-cutting substitution: replace humans with AI for existing tasks.

    2. Transformative complementarity: redeploy human effort to higher-value activities while AI handles well-specified sub-tasks.
      Mollick warns against a future of “infinite PowerPoints” — using agents to crank out more of the same low-value content — and urges deliberate design of workflows.

  • Hybrid workflow benefit.
    The OpenAI paper shows that if experts delegate to AI for a first draft, iterate a couple of times, and then finish manually if still unsatisfied, they can cut cost by ~60% and time by ~40%.


2. GDPval Benchmark (Patwardhan et al., OpenAI)

The paper that Mollick cites provides the empirical backbone.

  • Scope & realism.
    – Covers 44 occupations in the top 9 GDP-contributing sectors (e.g., finance, law, healthcare, media).
    – Tasks are drawn from real industry work-product (e.g., reports, slide decks, CAD files, spreadsheets, marketing copy), not synthetic exam-style prompts.
    – Average expert completion time: 7 h per task.
    – Each task reviewed by multiple professionals for realism and value.

  • Evaluation.
    Blinded human experts compared model vs human deliverables.
    – Best current models (Claude Opus 4.1, GPT-5)  [already today]  match or beat human experts in ≈ 48% of tasks.
    – Claude tended to win on aesthetics / formatting, GPT-5 on accuracy / instruction-following.
    – Instruction-following and formatting, not conceptual reasoning, are still the biggest weaknesses.

  • Efficiency upside.
    – A hybrid “try-the-model-then-fix-if-needed” workflow yields major time- and cost-savings over unaided human experts.
    – Careful prompting and scaffolding (e.g., forcing multimodal self-inspection of outputs) further boosts reliability.

  • Limits.
    – Currently restricted to digital knowledge-work tasks; excludes physical work, deep interpersonal interaction, or tasks requiring proprietary context.
    – Interactive, ill-specified tasks still degrade model performance.


3. “Illusion of Diminishing Returns” (Sinha et al.)

This preprint gives the theoretical & experimental explanation for the recent leap in “agentic” capability.

  • Key insight:
    – What matters economically is often the length of the task an agent can execute without failure (horizon length), not just its single-step test-score accuracy.
    – Even small improvements in per-step accuracy can yield exponential gains in horizon length once above ≈ 70% accuracy.
    – Thus, curves that look like “diminishing returns” on short-form benchmarks can mask continuing steep gains in long-task execution.

  • Execution vs Planning.
    – Many failures on long tasks are not due to lack of reasoning but to execution drift — the model knows the plan but makes slips as it proceeds.
    – Standard LLMs self-condition: when their own past mistakes appear in context, they become more likely to make new mistakes.
    – This explains why performance degrades with longer tasks even if each individual step is simple.

  • Thinking models fix it.
    – Recent “thinking” models (e.g., DeepSeek-R1, GPT-5-Horizon) use extra sequential test-time compute and reinforcement-learning alignment.
    – They can execute hundreds–thousands of steps in one go without succumbing to self-conditioning (GPT-5 Horizon > 1000 steps; next best, Claude-4 Sonnet ≈ 400).
    – Hence the strong real-world jump in agentic capability seen by Mollick.

  • Economic implication.
    – If the length of task reliably automatable drives value, then continued scaling of compute and reasoning depth may still be economically justified despite slowing gains on traditional benchmarks.


4. Putting It Together

Theme Mollick GDPval Illusion of Diminishing Returns
Problem Focus Practical ability of AI to do economically valuable tasks Rigorous benchmark of such tasks Theoretical reason why long-task ability is improving so fast
Key Take-away AI now competitive on many discrete tasks → potential to transform workflows Frontier models ≈ human expert quality in half of tasks; hybrid use brings large efficiency gains Small step-accuracy gains → huge horizon-length gains; “thinking” agents break past long-task barrier
Main Bottleneck Not reasoning but format, instruction-following, workflow integration Task realism, hybrid human oversight, better prompts Execution drift & self-conditioning on prior errors
Outlook Shift from narrow benchmarks → real-task productivity More sectors & interactive tasks to be benchmarked Continued scaling & agentic advances still yield high returns

Discussion / Interpretation

  • Shift of unit of analysis: Benchmarks → Tasks → Workflows.
    The three pieces collectively mark a turn away from asking “can a model answer exam questions?” to asking “can it do the work end-to-end — and for how long without failing?”

  • Hybrid human-AI teams as the near-term norm.
    Both Mollick and GDPval emphasize that oversight & review by skilled humans remains crucial for quality and for handling un-AI-able parts of jobs. The upside is large productivity gain, not yet full displacement.

  • Implications for policy & management.
    – We may see large value in task-level automation well before wholesale occupation-level replacement.
    – Organizations that redesign workflows to exploit long-horizon agents — rather than just swapping them in to cut costs — are likely to reap transformative benefits.
    – Benchmarking and governance need to track agentic execution ability and hybrid-workflow safety.

  • Research directions.
    – Building richer, more interactive GDPval-like suites for tasks requiring tacit knowledge, interpersonal nuance, or safety-critical judgment.
    – Improving agent robustness against self-conditioning and other compounding-error modes.
    – Exploring economic and labor-market modeling that accounts for jagged, domain-specific progress in agentic execution.


Bottom line:
Mollick’s essay captures the real-world inflection point; GDPval quantifies it across occupations; and the Sinha et al. paper explains why agentic execution is leaping forward and why scaling still pays off. Together they portray a moment where AI shifts from “benchmark curiosity” to a genuine co-worker on valuable multi-hour tasks — with profound implications for how work, science, and productivity will be organized in the next few years.