Karpathy’s Autoresearch: AI Agents Automate Scientific Discovery & Boost Efficiency

The Automation of Scientific Discovery: Andrej Karpathy’s ‘Autoresearch’ Project

Over the weekend, Andrej Karpathy—a prominent figure in the field of artificial intelligence, formerly with Tesla and OpenAI, and known for coining the term “vibe coding”—released autoresearch, a new open-source project designed to automate the process of scientific experimentation. This isn’t a polished product or a large-scale corporate undertaking; rather, it’s a remarkably concise 630-line script, available on GitHub under a permissive MIT License, with the ambitious goal of enabling AI agents to conduct research autonomously, even even as developers sleep. The project has quickly gained traction, sparking discussion about the potential to accelerate progress across numerous fields.

How Autoresearch Functions: An Autonomous Optimization Loop

At its core, autoresearch operates as an autonomous optimization loop. An AI agent is provided with a training script and a defined compute budget—typically around 5 minutes on a GPU. The agent then reads its own source code, formulates a hypothesis for improvement (such as adjusting a learning rate or modifying the architecture depth of a neural network), implements the change, runs the experiment, and evaluates the results. If the validation loss—measured in bits per byte (val_bpb)—decreases, the change is retained; otherwise, it’s reverted, and the agent attempts a different modification. Karpathy demonstrated the system’s capabilities with an overnight run where the agent completed 126 experiments, reducing the loss from 0.9979 to 0.9697.

Beyond Initial Results: Scaling and Transferable Improvements

Recent updates from Karpathy reveal even more promising results. After two days of tuning a “depth=12” model, the agent autonomously processed approximately 700 changes. Crucially, the agent identified around 20 improvements that seamlessly transferred to larger models. These combined improvements resulted in an 11% efficiency gain in the “Time to GPT-2” metric—a benchmark for training large language models—reducing the time from 2.02 hours to 1.80 hours. Karpathy noted that the agent identified oversights in attention scaling and regularization that had eluded him after two decades of experience, highlighting the potential for automated systems to surpass human intuition in certain areas.

The ‘Karpathy Loop’ Spreads: Emergent Strategies in a Distributed Network

The release of autoresearch quickly resonated within the AI and machine learning community, prompting rapid experimentation, and adaptation. Varun Mathur, CEO of AI tool aggregator Hyperspace AI, took the single-agent loop and distributed it across a peer-to-peer network. This created a system where each node running the Hyperspace agent functioned as an independent researcher. On the night of March 8–9, these 35 autonomous agents collectively ran 333 experiments unsupervised, revealing several emergent strategies. As Mathur observed on X, hardware diversity became a key factor: H100 GPUs favored aggressive learning rates, while CPU-only agents on laptops focused on initialization strategies and normalization choices due to their limited computational resources. The network also demonstrated a “gossip-based discovery” mechanism, where successful improvements—like the use of Kaiming initialization—spread rapidly through the system, with 23 other agents incorporating the discovery within hours. Remarkably, the agents independently rediscovered established machine learning techniques, such as RMSNorm and tied embeddings, that had taken human researchers years to develop.

From Machine Learning to Marketing: Expanding the Scope of Automated Experimentation

The implications of autoresearch extend far beyond the realm of machine learning. Eric Siu, founder of ad agency Single Grain, applied the concept to marketing, envisioning a future where marketing teams can run tens of thousands of experiments annually. Siu argued on X that current marketing teams typically conduct around 30 experiments per year, while the next generation of systems could easily run 36,500 or more. His framework involves replacing the training script with a marketing asset—a landing page, ad creative, or email—and allowing the agent to modify variables, measure results (such as positive reply rates), and retain or discard changes. This approach, Siu believes, will create a “proprietary map” of what resonates with a specific audience, providing a competitive advantage based on accumulated experimental data.

Community Discussion and Potential Pitfalls

The rapid adoption of autoresearch has also prompted thoughtful discussion within the GitHub community. Researcher alexisthual raised concerns about the potential for “spoiling” the validation set—the risk that excessive experimentation could lead to overfitting to the specific characteristics of the test data, rather than achieving genuine generalization. Karpathy responded by emphasizing that the system is focused on optimizing performance per compute, and that the observed gains are real and substantial. Another user, samionb, questioned the significance of the initial loss reduction, to which Karpathy reiterated the importance of incremental improvements in performance. Witcheer, Head of Growth at Yari Finance, documented their own overnight run on a Mac Mini M4, noting that even among failed experiments, the successful ones revealed that simpler models often perform better—an insight reached without human intervention.

The Future of Research: From Experimenter to Experimental Designer

The release of autoresearch signals a potential shift in the nature of research across various domains. As tools like DarkMatter, Optimization Arena, and NanoClaw emerge to support this type of automated experimentation, the primary role of humans may evolve from “experimenter” to “experimental designer”—focusing on defining the constraints and objectives of the search, rather than manually conducting the experiments themselves. The bottleneck to progress, Karpathy suggests, is no longer the speed of human coding, but our ability to formulate effective research questions. Andrej Karpathy has, once again, altered the landscape, moving us toward a future where machines learn while we sleep.