Advertisement

AI agents are swarming. Does your website need an llms.txt file?

Many are skeptical of a recently proposed standard that would provide AI tools with additional guidance, but tastes can change quickly.
Listen to this article
0:00
Learn more. This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.
chat bubbles
(Getty Images)

In a 2007 interview with Oprah promoting his novel The Road, the great American author Cormac McCarthy explained his sparse use of punctuation, which he called “weird little marks”: “If you write properly, you shouldn’t have to punctuate.”

It was an unusual, though not unheard of, creative choice, particularly for a writer whose favorite novel was Moby-Dick, a work containing roughly four thousand semicolons. McCarthy’s obstinacy contributed to his legend and further distinguished his prose, but to some readers, who liked em dashes and quotation marks, they seemed to take his deviation personally, and responses could grow beyond reasonable proportion. They were, after all, just “weird little marks.” Similarly, a prolonged public debate on the necessity of the Oxford comma, the comma that sets off the final item in a list, shows no sign of disappearing, despite the best efforts of the rock band Vampire Weekend, in 2008, to illuminate the issue’s absurdity.

But such disputes may be driven by an understanding by experts and connoisseurs in their respective fields that, sometimes, details aren’t merely details. A misused or omitted comma can completely change the meaning of a sentence, and if a writer’s task is to convey information, then such missteps can handily obliterate the value from their projects.

This might at least partially explain why software developers in curiously large numbers harbor such strong opinions on issues that outsiders view as inconsequential, like whether it’s preferable to use tabs or spaces when organizing computer code, which casing scheme is most comely or which text editor is most efficient. One timely and particularly arcane example is whether websites should bother to host llms.txt files, a proposed standard, not widely adopted, designed to provide artificial intelligence agents with guidance when they retrieve information or train their large language models.

Advertisement

Ray Bell, the State of Maryland’s AI and machine learning product director, last week received kudos from his followers on LinkedIn after sharing that Maryland.gov now hosts an llms.txt file. “This is pretty damn thoughtful,” commented Michael Flowers, a former senior adviser for the General Services Administration. Maryland’s new plain text file contains notes on how the site’s content is structured and how it should be used — “Content may be summarized or referenced for general informational purposes,” but do not, it warned, “infer legal, policy, or eligibility determinations beyond published content.” It contains policies on accessibility, contact information and a note on how frequently the site will be updated (“regularly”).

According to data shared by ScanGov, a service that measures how well government websites comply with standards of accessibility, performance and security, Maryland’s recent addition makes it the only state government in the nation to host an llms.txt file on its main website.

No one seems to believe that llms.txt files might be deleterious, but many claim they are redundant. Robots.txt files have long been used to provide similar information to web crawlers, the bots that help search engines index web content. Robots.txt files, which bots usually obey but aren’t required to read, primarily disclose which pages not to bother indexing and set limits on how often sites should be scanned, to avoid needlessly straining servers. Sitemaps, meanwhile, are an established means of passing on to search engines how websites are structured. Responding to a post about the novelty of llms.txt last summer, one user of a search-engine-optimization subreddit captured the community’s prevailing attitude: “I think it’s a waste of time and resources but we will probably add it to my site bc leadership loves dumb shit.”

Luke Fretwell, creator of the ScanGov tool and founder of the digital-government company ProudCity, is in the skeptics’ camp: “Maybe this will change over time, but llms.txt feels like a solution looking for a problem. We have robots.txt that can do all the things an LLMs text file can do.” But Fretwell is less interested in what files are named than whether governments are effective at serving their constituents. A willingness to meet current web standards reflects at least one facet of competency in public service. Fretwell provided scan data showing that there are still 10 states — including Florida, West Virginia and Wyoming — that don’t even have robots.txt files on their main websites.

Backers for the project to standardize the use of llms.txt include the software developer and entrepreneur Jeremy Howard, who drafted a proposal suggesting that the new file would “coexist with current web standards” and complement robots.txt “by providing context for allowed content.” His proposal points out that robots.txt is “generally used to let automated tools know what access to a site is considered acceptable,” while llms.txt would be used “on demand when a user explicitly requests information about a topic, such as when including a coding library’s documentation in a project, or when asking a chat bot with search functionality for information,” though, he suggested, the file might eventually be used during AI model training, too.

Advertisement

John Mueller, who works for Google as its “search advocate,” last summer noted on Bluesky that “no AI system currently uses llms.txt,” but later added that it couldn’t hurt to create such a file, monitor it and see what happens. Gary Illyes, a Google Search analyst, was more prickly when pressed online by a developer for clarity on whether the file’s use was acceptable — “What do you mean if that would be ok (with me)? People can do whatever they like with their sites. I do.”

Last month, Google itself apparently added llms.txt files to several of its domains, including its developers website, but they were removed soon after being discovered. Asking Claude, Anthropic’s chatbot, if llms.txt files “do anything” returns a response aligned with attitudes online: “They can do something, but in practice their impact is still pretty limited right now. Most AI tools don’t actively look for them.” (ChatGPT’s response to the same question was: “not really.”)

A September blog post on Anthropic’s website describing how to “create effective tools for agents” indicates that the company might start taking the new file seriously, noting that “LLM-friendly documentation can commonly be found in flat llms.txt files on official documentation sites.” The post then points to an llms.txt file hosted on Claude’s technical documentation website. The link is broken.

The rift over a particular text file’s usefulness mirrors a broader debate among government leaders over the appropriate way to govern AI. Many state government officials are creating policies dedicated exclusively to managing AI use in their organizations, but some technology leaders have suggested that whatever the distinction between AI and other technologies, it’s not enough to warrant such special treatment and that the segregation may eventually prove foolish. Bill Smith, Alaska’s chief information officer, told this publication last October that “in six months or in two years, AI is everywhere. If we treat it as something unique, we are going to be forever chasing our tails.”

Yet AI’s growing presence is hard to ignore. Bell, Maryland’s AI product director, shared in an email that the state has received nearly 12 million requests from 15 AI crawlers over the last 30 days, and that since adding the llms.txt file last Tuesday, “we have already seen significant requests for” it and “we expect that demand to continue to grow.” He said that while robots.txt “tells web crawlers where they are allowed to go,” it “doesn’t help them understand what they find.” Maryland’s llms.txt, meanwhile, offers a “curated, machine-readable index” of “the most critical, authoritative information,” a “reading list of our most trusted content.”

Advertisement

Bell said Maryland’s eagerness for “AI-readiness” is a result of the state’s AI enablement strategy and roadmap, which calls on officials to “strengthen the state’s data foundations.” Setting up llms.txt, he said, was “a practical step” toward meeting the roadmap’s recommendation to establish authoritative data sources. “By clearly defining our authoritative sources in a format AI can easily digest,” he wrote, “we reduce the risk of hallucinations.”

The rise in AI use is of course not only affecting government websites, but all of them. A growing proportion of the internet’s users are getting their information not by painstakingly pecking URLs into their web browsers and searching for information manually, but by shouting commands at their phones, watches, cars, smart speakers and refrigerators. Research published by Pew last July shows that Google users who encounter AI summaries are roughly half as likely to click on links in their search results, and far more likely to immediately stop searching.

It’s been as this paradigm has emerged that government’s “one door has given way to an any door philosophy,” said Crosby Burns, the chief digital and innovation officer of Marin County, California. If technology officials like Burns hold dogmatic views on technology — Android over iPhone, tabs over spaces, llms.txt instead of robots.txt only — they’ve done an exceptional job of keeping them secret. Their collective compass is reliably set on improving outcomes, using whatever technology seems to work best. When it comes to dispensing official, government-stamped information, this means not caring how people get it, so long as it’s accurate.

Burns said he recently discovered that his county’s website, by having adhered to a “security-first approach” to technology, a common thing in government, has been blocking many benign bots from accessing a lot of its information and dispersing it on the many different platforms and services that people use. The information may, technically, have been available online, but for the increasingly large user base that defaults to consulting bots, it may as well have been missing. He didn’t weigh in on llms.txt, a luxury that’s perhaps beyond reach considering the urgency of his current challenges: “We were in the stop-the-bleeding phase. I view it as a code red when our constituents can’t access publicly available information.”

“I think it’s a classic example of no one has mal-intent here,” Burns said. “No one is actively working to block the public from accessing certain services, but just because of the nature of the system, the incentives have lined up such that you just can’t access certain information in now increasingly commonplace ways. Everyone wants a chatbot, but you need that base layer of information architecture that speaks the language of the constituent, otherwise you’re just going to be layering bad technology on bad information.”

Advertisement

It’s been a daunting task, and one not quite yet complete, to remediate each software platform complicit in cordoning off the county’s information, he said, but the job of more lasting importance has been “thinking about the top of the funnel”: a policy for vendors that maintains security, but that for the sake of ensuring the public can access information, doesn’t put it first. In an age of constant cyberattacks, Burns’ suggestion is a bold one.

Editor’s note: Luke Fretwell is a former employee of Scoop News Group.

Latest Podcasts