Ivory Tang

RL Reigns Supreme

RL: The Dawn of a New Era

Reinforcement learning (RL)—where agents learn optimal behaviors by interacting with their environments—has powered some of AI’s most iconic breakthroughs and promises to unlock the next wave of advanced capabilities. From DeepMind’s AlphaGo mastering the game of Go to AlphaDev uncovering faster sorting algorithms, RL has repeatedly proven its ability to drive discovery and performance in highly structured domains.

Reinforcement learning from human feedback (RLHF) emerged to align large language models with human preferences. More recently, Reinforcement Learning from AI Feedback (RLAIF) replaced human critics with AI-based evaluators. At the heart of DeepSeek’s advanced reasoning lies Group Relative Policy Optimization (GRPO), which avoids dependence on accurate value functions, enabling superior generalization in sparse or unstructured reward settings. As Rich Sutton famously observed in “The Bitter Lesson,” the trend toward general-purpose methods empowered by scale and experience continues to prove itself.

Today’s RL frontier increasingly leans on diverse environments with programmatically verifiable rewards. Models reason through tasks during test-time computation, with human-designed autograders supplying precise signals to reinforce successful reasoning paths.

Yet many real-world challenges inherently lack clear ground truths, involving subjective or conflicting expert opinions—making it hard to craft dense, reliable rewards. A particularly promising domain is computer use—tasks like coding, editing text, or operating complex software—where actions are discrete and outcomes can often be programmatically validated despite ambiguities around what makes an edit or function truly “good.”

With open-source frameworks like Gymnasium and RLlib democratizing infrastructure, RL experimentation is no longer confined to research labs. But adoption in enterprise environments remains limited—a point that sparked lively debate during our recent “Building the Next Cursor” Dinner. This post explores why momentum is finally building-highlighting technical hurdles being overcome, new opportunities in RL infrastructure and applications, and strategic guidance for startups poised to lead this frontier.

Why Generalized RL Falls Short in Practice

Despite impressive advances, today’s RL systems struggle to generalize beyond narrow training contexts:

Limited environment diversity and scale: Real-world robustness demands exposure to vastly more diverse and verifiable environments—at far greater scale—to foster more general skills and complete longer-range tasks.
Reward hacking undermines learning integrity: Agents frequently exploit loopholes in ambiguous reward signals. Researchers have focused on high-value, verifiable domains—such as software coding and formal reasoning—where correctness can be programmatically validated, though even these are susceptible to exploitation. Approaches like Reinforcement Learning from Verifiable Rewards (RLVR) anchor learning to objective outcomes, reducing reward hacking. Recent work, like Tulu 3, demonstrates how integrating automated validation pipelines improves agent reliability by discouraging unintended behaviors.

Looking ahead, future general models like o5 may one-shot many tasks thanks to richer RL environments. Yet humans possess open-ended capabilities, and enterprise-critical tasks demand tailored solutions. Believing a single model can reliably handle every scenario—even unseen or adversarially designed ones—remains unrealistic. A custom RL agent, however, can learn through ongoing interaction and feedback, adapting to nuances that would trip up general models.

This sets the stage for the emergence of the following categories: 1) robust, realistic simulation environments to push model capabilities, 2) Reinforcement Learning-as-a-Service (RLaaS) to empower enterprises to harness the full potential of custom RL at scale, and 3) vertical-specific applications of RL for new discoveries in domains that lie completely outside the domain of the big labs.

The Need for Robust Environments to Train the Next Generation of Foundation Models

The potential of autonomous agents managing intricate software suites—like Salesforce, Office, Atlassian, Adobe, or Blender—represents a massive opportunity. But to realize this, agents must train in environments indistinguishable from the software they’ll operate on. Mechanize recently coined Replication Training, a paradigm where agents recreate existing software workflows with high fidelity. Similarly, Habitat is developing environments offering “hundreds of diverse, programmatically verifiable problems just out of reach of current models.”

Companies are building application replicas that capture every UI interaction and state transition, ensuring precise grading of agent performance. To remain useful, these replicas must keep pace with frequent SaaS updates—often weekly—and offer advanced debugging tools. Accuracy is paramount, as perfectly matching reference behavior becomes the definitive evaluation metric.

Human data focused startups are increasingly using expert knowledge to help define the rewards for LLMs to train on and building specific RL environments and evaluations for agents to learn from; Mercor signaled that they’re doubling down on evals for the RL era in a recent public post and Datacurve is building RL environments for repo-wide code evaluation and verification tasks.

Initially, humans need to provide models with a baseline understanding and enable them to start receiving useful reward signals. Then, RL on the custom environments will be able to capture the full complexity of real-world tasks, allowing compute to be directly converted into performance improvements. Critically, these hyper-realistic simulation environments demand:

Large-scale, high-complexity simulations: Modeling modern, dynamic software interfaces accurately at scale.
Deterministic outcomes despite rapid updates: Ensuring stable evaluation even as SaaS platforms evolve.
Robust, comprehensive tests: Capturing long-horizon task performance with confidence.
Compliance with security and licensing constraints: Restricting search API access to prevent competitive misuse, as evidenced by Slack's recent move.

The Need for RL-as-a-Service (RLaaS) in Enterprises

For enterprises, RLaaS represents the application of RL directly to revenue-critical workflows and proprietary data—far beyond the experiments often seen in research labs. Think high-traffic customer websites, complex ERP order flows, or vast corpuses of internal documents—where RL can unlock transformative efficiency and performance.

Enterprise RL demands billion-parameter networks alongside tens of thousands of parallel actors simulating agent behavior across diverse, realistic environments. Since few enterprises have the expertise or infrastructure to build and maintain such systems, RLaaS vendors provide managed frameworks with automated data mocking, environment orchestration, and powerful debugging tools—letting teams iterate rapidly and safely without touching sensitive live systems.

A typical enterprise-grade RL loop includes:

Custom Reward Modeling: Enterprises define KPIs—conversion rates, compliance metrics, churn reduction—which are translated into precise, sophisticated reward functions tailored to each task.
Autograders: Automated pipelines rigorously score rollouts against deterministic tests or regression suites.
Customizing Model Checkpoints: RLaaS providers increasingly adapt open-source models by customizing specific layers to align with enterprise-specific objectives—a process related to model merging, which Mira Murati reportedly plans to leverage at Thinking Machines.

Engagements often start with an upfront fee covering specialized Reinforcement Fine-Tuning (RFT) services—dedicated compute and expert reward engineering—followed by recurring subscriptions for continuous updates and monitoring to keep agents aligned with shifting business objectives. Leading RLaaS providers differentiate themselves with proprietary reward-engineering frameworks, translating nuanced KPIs into robust, verifiable reward signals, supported by human-in-the-loop feedback. Critically, enterprises owning proprietary simulation environments accumulate compounding advantages, tailoring every element—telemetry, user personas, metrics—to their unique needs and locking in strategic differentiation.

The real-world impact of this approach is exemplified by a case from Veris AI: an agent trained with RL to automate the complex, hours-long process of supplier negotiations. By training on realistic simulations of Slack and email conversations—complete with sensitive data—the agent learns optimal tone, questions to ask, and search strategies, dramatically outperforming prompt chaining or one-shot LLM attempts.

RL’s Enterprise Moment

We’re at a pivotal juncture: RL is poised to transition from research labs to enterprise-critical infrastructure. There are quite a few companies that are 1) bringing RL to the enterprise (e.g. Applied Compute, Thinking Machines, Veris, RunRL, OpenPipe, Conway, Osmosis, Judgement Labs, Agentica, General Reasoning) and 2) building robust RL environments for the AI labs (e.g. Mechanize, Matrices, Fleet, Habitat, Plato, Deeptune, Theta, Halluminate, Hud).

We’re convinced RL will unlock superhuman performance across industries and workflows. If you’re building the infrastructure or applications pushing this frontier, we’d love to connect—reach out at ivory@chemistry.vc.

July 7, 2025

Authors