Google open-sources robots.txt parser in push to make Robots Exclusion Protocol an official standard

Google wants to turn the decades-old Robots Exclusion Protocol (REP) into an official internet standard — and it’s making its own robots.txt parser open source as part of the push.

The REP, which was proposed as a standard by Dutch software engineer Martijn Koster back in 1994, has pretty much become the standard used by websites to tell automated crawlers which parts of a website should not be processed. Google’s Googlebot crawler, for example, scans the robots.txt file when indexing websites to check for special instructions on what sections it should ignore — and if there is no such file in the root directory, it will assume that it’s fine to crawl (and index) the whole site. These files are not always used to give direct crawling instructions, though, as they can also be stuffed with certain keywords to improve search engine optimization, among other use-cases.

It’s worth noting that not all crawlers respect robots.txt files, with the likes of the Internet Archive electing to pull support for its Wayback Machine archiving tool a couple of years ago, while other more malicious crawlers also choose to ignore REP.

While the REP is often referred to as a “standard,” it has never in fact become a true internet standard, as defined by the Internet Engineering Task Force (IETF) — the internet’s not-for-profit open standard’s organization. And that is what Google is now pushing to change. It said that the REP, as it stands, is open to interpretation and may not always cover what Google calls “today’s corner cases.”

Defining the undefined

It’s all about better defining existing “undefined scenarios” — for example, how should a crawler treat a server failure scenario that renders a robots.txt file as inaccessible, when its content is already known from a previous scan? And how should a crawler treat a rule that has a typo?

But there are also lots of typos in robots.txt files. Most people miss colons in the rules, and some misspell them. What should crawlers do with a rule named “Dis Allow”? pic.twitter.com/nZEIyPYI9R

— Google Webmasters (@googlewmc) July 1, 2019

“This is a challenging problem for website owners because the ambiguous de-facto standard made it difficult to write the rules correctly,” Google wrote in a blog post. “We wanted to help website owners and developers create amazing experiences on the internet instead of worrying about how to control crawlers.”

Google said that it has partnered with the REP’s original author, Martijn Koster, along with webmasters and other search engines, to submit a proposal to the IETF covering “how REP is used on the modern web.”

The company hasn’t published the draft in full, but it did give some indication as to some of the areas it’s focusing on:

Any URI based transfer protocol can use robots.txt. For example, it’s not limited to HTTP anymore and can be used for FTP or CoAP as well.
Developers must parse at least the first 500 kibibytes of a robots.txt. Defining a maximum file size ensures that connections are not open for too long, alleviating unnecessary strain on servers.
A new maximum caching time of 24 hours or cache directive value if available, gives website owners the flexibility to update their robots.txt whenever they want, and crawlers aren’t overloading websites with robots.txt requests. For example, in the case of HTTP, Cache-Control headers could be used for determining caching time.
The specification now provisions that when a previously accessible robots.txt file becomes inaccessible due to server failures, known disallowed pages are not crawled for a reasonably long period of time.

It’s also worth noting here that crawlers can interpret instructions contained within robots.txt files differently, which can lead to confusion for website owners. And that is why Google has also put the C++ library that underpins Googlebot’s parsing and matching systems on GitHub for anyone to access. Google wants developers to build their own parsers that “better reflect Google’s robots.txt parsing and matching,” according to the GitHub release notes.

Source link