The Last Human Job in AI

A freelance journalist loses her content marketing career to ChatGPT. Weeks later, she is recruited by a company called Mercor to train the very models that made her redundant. She turns on her webcam, says hello to an AI interviewer named Melvin, signs a contract, installs surveillance software on her laptop, and starts writing rubrics that teach a chatbot how to do her old job better.

She is not alone. Across the data labelling industry, screenwriters are recording themselves pretending to ask chatbots for fitness plans. Graphic designers are tagging Instagram Reels for Meta. Award-winning documentary makers are quietly producing training data to pay rent. A linguist with a master's degree spent a year writing prompts designed to stump language models, only to watch the models catch up. His expertise has been extracted. He has been without work for five months.

❝

This is not a dystopian thought experiment. It is the largest harvesting of human expertise ever attempted, and it is happening right now, at industrial scale, with billions of dollars behind it.

The Human Assembly Line

A joint investigation by The Verge and New York Magazine recently pulled back the curtain on the sprawling human supply chain that powers today's frontier AI models. The picture it paints is striking in its complexity and its precarity.

The work itself is fragmented into dozens of micro-specialisms. Some workers craft "rubrics," the detailed checklists that define what a good chatbot response looks like. Others grade chatbot answers against those rubrics. Others write "golden outputs," the ideal response a model should aspire to. Still others produce "reasoning traces," step-by-step explanations of how they arrived at the golden output, written in the voice of a chatbot thinking to itself. These traces become the roadmap for AI to follow when it encounters similar tasks in the real world.

Then there are the "stumpers," prompts specifically designed to make models fail. Workers spend hours hunting for the gaps in a model's knowledge, the counterintuitive blind spots where it can solve advanced physics but cannot give accurate transit directions. Finding these weak spots takes genuine creativity and deep domain knowledge, which is precisely why labs are willing to pay for it.

Mercor, founded in 2023 by three then-19-year-olds from the Bay Area, has become the poster child for this new economy. The company claims it now pays out roughly $1.5 million a day to its evaluators, with more than 30,000 experts on its platform. As of October 2025, its valuation sits at $10 billion, making its founders the youngest self-made billionaires in history. OpenAI and Anthropic are among its clients. So are six of the "Magnificent Seven" tech giants.

Scale AI claims over 700,000 annotators. Surge AI crossed $1 billion in revenue and advertises Supreme Court litigators, McKinsey principals, and platinum recording artists on its roster. Leading AI companies like OpenAI, Google, Meta, and Anthropic are each spending on the order of $1 billion per year on human-provided training data. The global AI training dataset market was valued at roughly $3.2 billion in 2025 and is projected to reach $16.3 billion by 2033.

❝

The scale is staggering. But the human cost is what should concern us.

Digging Your Own Grave

The working conditions described by data workers paint a grim picture of the modern gig economy pushed to its logical extreme.

Workers report being added to Slack channels where new tasks arrive in unpredictable bursts. When a manager posts that work is incoming, hundreds of people drop everything and race to claim tasks before the queue empties. One worker described it as living in a fishbowl, waiting for human masters to drop in food, where only the fastest swimmers get to eat.

Surveillance software called Insightful tracks every keystroke and monitors productivity to the second. Time deemed "unproductive" can be deducted from pay. If a few minutes pass without typing, the system pings to ask whether the worker is still working. Workers report turning off the tracker to read instructions, then working off the clock to avoid going over target times. Some have been terminated without explanation. On Surge AI's platform, workers simply log in one day to find an empty dashboard, a phenomenon so common they call it "the dash of death."

Pay deteriorates over time. Nearly every worker interviewed reported that demands increased, time requirements shrank, and compensation decreased as projects continued. Those who could not keep up were "offboarded" and replaced by new recruits. One project cut its hourly rate from $45 to $16, dropping below minimum wage in several US states. Workers kept going because they needed the money.

The existential dimension makes it worse. A screenwriter put it plainly: "I'm being handed a shovel and told to dig my own grave." Every rubric written, every golden output produced, every reasoning trace documented is one more step toward making the worker's own expertise redundant. The linguist who spent a year stumping models watched the window close in real time. Any obscure theory or Indigenous language he asked about, the model found the correct papers. His know-how had been extracted. The project ended.

❝

Each skill set has a shelf life, and it is getting shorter.

The $10 Billion Question: How Long Do Humans Stay in the Loop?

This is where the story gets interesting for anyone watching the AI infrastructure space. The current human data pipeline is expensive, slow, and riddled with quality problems. The question is not whether AI labs want to automate it. They are already trying.

The most significant move has already been made by Anthropic itself. Constitutional AI, the alignment technique underpinning Claude, uses a model to evaluate which of two samples is better, and then trains a preference model from this dataset of AI preferences. In other words, instead of paying thousands of humans to rate chatbot responses, Anthropic trains models to rate each other against a set of written principles. The technique is called Reinforcement Learning from AI Feedback, or RLAIF, and it has become a default method within the post-training and RLHF literatures.

The economics are compelling. A single piece of human preference data costs on the order of $1 or higher, while AI feedback with a frontier model costs less than $0.01. That is a cost reduction of two orders of magnitude. For labs spending billions annually on human data, the incentive to shift is enormous.

OpenAI has experimented with using GPT-4 to evaluate responses from other models, reducing the load on human evaluators. Its GDPval benchmark achieves 66 percent agreement with human expert evaluators, approaching the 71 percent inter-rater agreement observed between human experts. Automated red-teaming frameworks are proliferating. Microsoft Research introduced AgentInstruct, a multi-agent workflow that automates the generation of synthetic training data, and demonstrated substantial improvements across benchmarks.

Gartner projects that more than 60 percent of training data will be synthetic by 2027. That is not a distant horizon. It is 18 months away.

The Model Collapse Firewall

So why are labs still spending billions on human data? One word: collapse.

Model collapse is the phenomenon where AI trained on AI-generated data gradually degrades, losing the diversity and accuracy of its outputs over successive training generations. Researchers discovered that indiscriminately learning from data produced by other models causes a degenerative process whereby models forget the true underlying data distribution. A landmark study published in Nature demonstrated the effect across multiple model types.

The implications are severe. Even the smallest fraction of synthetic data, as little as one per thousand, can still lead to model collapse, according to research presented at ICLR 2025 as a Spotlight paper. By April 2025, 74.2 percent of newly created webpages contained some AI-generated text, meaning the internet itself is becoming increasingly contaminated as a training source.

This is the firewall keeping humans in the loop. The ground truth of verified human expertise is what data companies are selling. When AI trains on AI, quality degrades. When humans cheat by using AI to produce their data (which surveillance software is designed to prevent), the same degradation occurs.

But here is the critical nuance: model collapse is an engineering problem, not a law of physics. When you accumulate synthetic data alongside the original real data, models stay stable across sizes and modalities. The solution is not to avoid synthetic data entirely but to manage the ratio carefully and maintain access to verified human-generated anchors.

❝

Model collapse keeps humans in the training loop today. But it is a bottleneck being actively engineered around, not a permanent moat.

When Agents Enter the Pipeline

The next logical step is already visible. Agentic AI systems, models that can browse the web, execute code, verify facts, and chain together multi-step reasoning, are beginning to enter the training pipeline itself.

Consider the tasks currently done by humans in the data supply chain. Rubric writing requires domain expertise and the ability to define quality criteria. Stumper generation requires creativity and knowledge of a model's blind spots. Golden output production requires the ability to perform a task at expert level. Reasoning trace creation requires the ability to articulate a chain of thought.

Each of these is a task that agentic systems are becoming increasingly capable of performing. An agent that can search the web, cross-reference academic papers, and verify factual claims could handle a meaningful portion of stumper generation. An agent that can execute code and test outputs could write and validate golden outputs in technical domains. An agent with access to rubric templates and domain-specific evaluation criteria could draft rubrics for human review rather than human creation.

We are already seeing this pattern emerge in adjacent domains. Human-centric oversight is already failing in production, with automated systems malfunctioning before humans even realise something went wrong. The push toward AI-native governance and evaluation is not theoretical. It is happening because the volume of AI outputs has already exceeded human capacity to review them.

The "world-building" exercises described in the Verge investigation, where teams of lawyers and consultants create fictional corporate environments to generate training data, are particularly ripe for automation. An agentic system with access to document templates, financial modelling tools, and domain-specific knowledge bases could generate thousands of synthetic corporate environments at a fraction of the cost of paying human teams for 16-hour days of fantasy document production.

The Roles That Survive (For Now)

Not all human roles in the training pipeline face the same timeline. The roles most resistant to automation share a common characteristic: they require judgment that models cannot self-verify.

A cardiologist evaluating whether a model's medical advice could harm a patient. A trial lawyer stress-testing legal reasoning against real courtroom dynamics. A compliance officer assessing whether a model's financial guidance meets regulatory requirements. These roles survive because the cost of getting it wrong is high and the feedback signal is ambiguous. There is no equivalent of "does the code compile" for medical advice.

Cultural and ethical judgment is another area where human involvement remains essential. Decisions about tone, appropriateness, and contextual sensitivity require lived experience that current models approximate but do not possess. The worker who was asked to evaluate chatbot conversations between Malaysian and Vietnamese users practising English was performing a fundamentally human assessment of cross-cultural communication.

Novel edge cases also remain stubbornly human. The stumper generation task exists precisely because models have predictable failure modes that require human creativity to expose. But even this role has a finite lifespan. As models improve at self-evaluation, the bar for "human-only" edge cases keeps rising.

The realistic picture is a progressively shrinking core of human involvement. Human judgement remains the anchor for training, but it is too expensive, slow, and capacity-constrained to scale linearly. The future is hybrid: small amounts of high-quality human data anchoring large volumes of synthetic data, with agentic systems handling the middle layers of the pipeline.

The Labour Reckoning

Three class-action lawsuits have been filed against Mercor in the past six months, accusing companies of misclassifying their AI trainers as independent contractors rather than employees. Similar suits were previously filed against Surge AI and Scale AI. The law firm Clarkson, which is pursuing multiple cases, draws direct comparisons to the early days of Uber and Lyft.

But in some ways, data workers are in a worse position than gig economy drivers. Drivers have to be physically present in a city to work, which gives them leverage to organise and push for regulation. If data workers in California pushed for better conditions, companies could simply recruit from jurisdictions with lower wage expectations. When Mercor cut pay to $16 per hour on one project, it dropped below minimum wage in several states, yet workers kept going.

Daron Acemoglu, professor of economics at MIT, draws a historical parallel that should give everyone pause. Before the industrial revolution, weavers were the labour aristocracy: self-employed artisans in control of their own time. Then came weaving machines, and to survive, they were forced into factory jobs with longer hours, less pay, and close management supervision. The problem was not simply that technology took their jobs. It enabled a new organisation of work that concentrated all power with the owners of capital.

The same dynamic is playing out in knowledge work. The strict confidentiality agreements that data workers sign prevent them from establishing seniority, building reputations, or leveraging their experience. They cannot prove what they have done, cannot demand what they are worth, and cannot organise collectively. The only power they have is to keep going, to get back in line.

❝

If the humans training AI are already being treated as disposable, what happens when agents can do their jobs too?

What This Means for Payments and Commerce

For our readers in the payments and commerce space, this is not an abstract labour story. It is a preview of what is coming for financial services.

The same data pipeline that extracts expertise from screenwriters and linguists is already targeting financial analysts, compliance teams, fraud investigators, and payments specialists. Mercor's CEO has been explicit about this, stating at TechCrunch Disrupt that AI labs are now tapping former senior employees from companies for their industry knowledge rather than signing expensive data contracts. The example he used was Goldman Sachs.

If you work in fintech, in card network operations, in transaction monitoring, in merchant services, your domain expertise is exactly the type of data that AI labs are willing to pay billions to acquire. The question is whether that acquisition happens on your terms or theirs.

The companies building the rails for agentic commerce, the same systems we have been tracking across our coverage of Mastercard, Visa, and the broader payments ecosystem, will need training data from precisely the professionals who currently operate those systems. The rubrics that define "good" fraud detection, the golden outputs that demonstrate proper transaction categorisation, the reasoning traces that explain chargeback decisions: all of this is the raw material of the next generation of AI models.

Acemoglu's warning about the need for collective data ownership and unionlike organisations is not just about protecting screenwriters. It applies equally to every knowledge worker whose expertise is being harvested to train systems that will eventually automate their roles.

The Clock Is Ticking

The current phase of mass human data production is real, it is enormous, and it is creating genuine hardship for the people caught in it. But it is also almost certainly transitional.

The trajectory is clear. Synthetic data volumes are growing exponentially. RLAIF techniques are reducing dependence on human feedback. Agentic systems are entering the evaluation pipeline. Model collapse remains the primary constraint, and it is being actively engineered around.

The optimistic read is that this transition creates a window for policy intervention: proper worker classification, portable reputation systems, collective bargaining frameworks for data contributors, and regulatory guardrails that ensure the humans powering AI development are treated with dignity while their roles still exist.

The pessimistic read is that the window is already closing. One worker interviewed for the Verge investigation ultimately decided the precarity was not worth it and applied to work at a local coffee shop.

The AI industry talks constantly about building systems that benefit humanity. The test of that claim is not in the capabilities of the models. It is in how the industry treats the humans who make those models possible.

Sources

If the humans training AI are already being replaced by AI, what does that tell us about every other knowledge role in the pipeline, including yours?