The Internet's Most Powerful Archiving Tool Is in Peril
Major news publishers including USA Today Co. and The New York Times are blocking the Internet Archive's Wayback Machine from archiving their content, citing fears about AI data scraping. This threatens one of the internet's most important tools for preserving public records, and...
According to Wired's coverage of the escalating conflict, the Wayback Machine, which has preserved over one trillion web pages across more than 30 years of operation, is now facing systematic blocking by the very news organizations that depend on it most. The story highlights a deep contradiction at the heart of modern media: publishers are using archived web data to do accountability journalism while simultaneously preventing that same archiving from applying to their own work. Wayback Machine director Mark Graham is speaking out, and a coalition of more than 100 journalists has now signed an open letter demanding publishers reverse course.
Why This Matters
The Wayback Machine is not a niche research tool. It is foundational infrastructure for journalism, legal proceedings, and government accountability, and its degradation would create a hole in the public record that no corporate product will fill. Twenty-three major news sites are currently blocking the ia_archiverbot crawler, according to analysis by AI-detection startup Originality AI. That number sounds manageable until you realize USA Today Co. alone controls more than 200 media outlets, meaning a single corporate blocking decision wipes out archival access to hundreds of publications at once. Courts cite Wayback Machine archives as evidence, researchers depend on it to study misinformation, and journalists use it daily to verify claims. This is not an abstract digital rights debate. This is a fight over who controls the historical record.
Daily briefing from 50+ sources. Free, 5-minute read.
The Full Story
The contradiction that kicked off this latest wave of coverage is almost too on-the-nose to believe. In April 2026, USA Today published an investigative report tracking how US Immigration and Customs Enforcement had altered its detention data disclosures under the Trump administration. The reporters built their story by pulling historical ICE statistics from the Wayback Machine, comparing cached versions of government pages to document changes over time. The story was solid public interest journalism. It was also only possible because the Internet Archive exists.
Here is the problem: USA Today Co., the Gannett successor that owns both USA Today and more than 200 other media outlets, actively blocks the Wayback Machine from archiving its own content. Mark Graham, the Wayback Machine's director, called this "a little ironic," which is probably the most diplomatic framing possible. USA Today Co. spokesperson Lark-Marie Anton framed the blocking as part of a broader anti-scraping policy rather than a targeted move against the Archive specifically, but the practical effect is the same either way.
The publisher driving most of the public conversation is not actually Gannett. The New York Times has also moved to restrict archiving, and Reddit, which is not a news organization but hosts enormous amounts of public discourse, has blocked the ia_archiverbot crawler too. The Guardian took a different approach: it does not block the crawler outright, but it excludes its content from the Internet Archive's API and filters articles out of the Wayback Machine interface, making it functionally inaccessible to most users even when technically archived. Robert Hahn, the Guardian's director of business affairs and licensing, said the outlet is in talks with the Archive over concerns that AI companies might harvest archived content sets.
That AI concern is the common thread running through all of these blocking decisions. Publishers are terrified that their content will end up in AI training datasets without compensation or permission. They have watched major lawsuits against OpenAI and other AI firms move slowly through the courts, and they have decided that limiting all potential vectors for data harvesting, including nonprofit archiving services, is preferable to waiting for legal clarity. The logic is understandable from a narrow corporate perspective and deeply damaging from a public interest one.
The response from working journalists has been swift. This week, the Electronic Frontier Foundation and Fight for the Future organized a coalition letter delivered to the Internet Archive expressing support for the Wayback Machine's preservation mission. More than 100 journalists signed it. The signatory list runs from Rachel Maddow, a television anchor with a national audience, to independent journalists including Kat Tenbarge of Spitfire News and Taylor Lorenz of User Mag. Laura Flynn, a supervising podcast producer at The Intercept and a signatory, described the Archive as an "essential tool" she has relied on throughout her career for fact-checking and surfacing audio clips. The letter itself makes a pointed argument: with local newspapers shutting down and no public library system equipped to preserve digital-only reporting, the Internet Archive is the only institution doing this work at scale.
Key Details
- Originality AI analysis found 23 major news sites currently blocking the ia_archiverbot crawler as of April 2026.
- USA Today Co. owns more than 200 media outlets and blocks the Wayback Machine across all of them.
- The Internet Archive has preserved over one trillion web pages across more than 30 years.
- More than 100 working journalists signed the coalition letter organized by the EFF and Fight for the Future.
- Signatories include Rachel Maddow, Kat Tenbarge of Spitfire News, and Taylor Lorenz of User Mag.
- The Guardian does not block the crawler but excludes content from the Archive API and Wayback Machine interface.
- Reddit implemented Wayback Machine restrictions alongside broader data licensing policies targeting AI companies.
What's Next
The Internet Archive is currently in direct conversations with at least the Guardian over how to address publishers' AI concerns without blocking preservation entirely, and those talks will likely become a template for how other negotiations unfold. Watch for whether the coalition letter prompts any public response from USA Today Co. or the New York Times, since named pressure from high-profile signatories like Maddow creates reputational stakes that a generic legal complaint does not. If no major publisher reverses course by mid-2026, expect the EFF to escalate toward formal legal or legislative advocacy, given that courts already treat Wayback Machine archives as credible evidence.
How This Compares
This situation sits inside a much larger war over data ownership that has been building since at least 2023. The New York Times filed its lawsuit against OpenAI in December 2023, arguing that training data was harvested without consent, and that lawsuit reframed how every major publisher thinks about any open access to their content. The Wayback Machine is collateral damage from that legal conflict, even though the Archive is a nonprofit that does not train AI models.
Compare this to Reddit's approach. Reddit introduced paid API access in 2023, explicitly to prevent AI companies from using its data for free, and then extended that logic to blocking the Wayback Machine. Reddit's situation is instructive because it shows how anti-AI-scraping policy tends to expand beyond its original target. What starts as a move against OpenAI ends up restricting academic researchers, journalists, and archivists who pose no commercial threat.
The broader pattern here mirrors what happened when publishers started implementing aggressive paywalls in the early 2010s. The stated goal was financial sustainability, but the practical effect was fragmenting the public record. Blocking the Wayback Machine is the archival equivalent of that mistake, and the AI news community has been watching this tension between open information access and corporate data control intensify for years. The difference now is that AI has given publishers a new justification that sounds technically sophisticated, making it harder to argue against without seeming naive about how data actually moves through the internet.
FAQ
Q: What does the Wayback Machine actually do? A: The Wayback Machine is a free tool run by the Internet Archive that takes snapshots of web pages and stores them permanently. If a news article gets deleted, a government page gets altered, or a company quietly changes its website, you can often find the original version using the Wayback Machine. It has archived over one trillion pages since launching more than 30 years ago.
Q: Why are news publishers blocking it now? A: Publishers are primarily worried that AI companies will scrape archived versions of their articles to build or improve large language models without paying for the content. Since the Archive is openly accessible, publishers see it as a potential backdoor that bypasses their paywalls and terms of service. Blocking the Archive's crawler is their way of closing that door, even though the Archive itself does not train AI systems.
Q: How does this affect regular people who use the Wayback Machine? A: If you try to access an archived version of a New York Times or USA Today article published after these blocks took effect, you will find that no new snapshots exist. For older articles already in the Archive, access depends on which restrictions each publisher has imposed. The Guardian's approach, filtering content out of the interface without fully blocking the crawler, is particularly confusing because articles may technically exist in the Archive but not appear in normal searches.
The fight over the Wayback Machine is ultimately a fight over whether the historical internet belongs to the public or to the corporations that published it, and the outcome will shape how researchers, lawyers, and journalists do their work for decades. If publishers succeed in fragmenting the archive, future accountability reporting like the USA Today ICE investigation becomes harder or impossible to replicate. Subscribe to the AI Agents Daily weekly newsletter for daily updates on AI agents, tools, and automation.
Get stories like this daily
Free briefing. Curated from 50+ sources. 5-minute read every morning.




