Mastering the Art of Scraping Robots.txt & Sitemaps

Robots file and the XML sitemaps are one of the most important tools for the organization of data available on the web. They allow search engines to know what to crawl, making page indexing more efficient.

Some days ago I read on Linkedin a post from 

Gergely Orosz, author of The Pragmatic Engineer, about banning the new GPTbot User Agent from his blog, to avoid that new content will be ingested by OpenAI web crawlers and used to feed its AI models.

It’s a position I totally agree with, and I’ve already talked about that in this newsletter and on my Linkedin too.

https%3A%2F%2Fsubstack post 7715 4731 96e7
A Sitemap imagined by Midjourney

But this post sparked also some curiosity on my side: who invented the robots file and when? And is it legal to ignore it when scraping? And what about XML sitemaps?

So I’ve done my research and discovered something I didn’t expect.

The birth of robots file

According to Wikipedia, The standard was proposed by Martijn Koster, when working for Nexor in February 1994 on the www-talk mailing list, the main communication channel for WWW-related activities at the time. The idea came out when a badly configured crawler caused a Denial of Service on Koster’s website.

Since the first website was published on the 6th of August 1991, we can say that the robots.txt (and crawlers, also) are almost as old as the web itself.

It becomes quickly a de facto standard, but surprisingly it’s not yet an official standard, since only in September 2022 has been proposed as a standard to the Internet Engineering Task Force by Google.

What is a robots file?

Technically, it’s a txt file (in fact it’s often called as robots.txt file), placed on the top-level domain of a website. Its primary purpose is to communicate with web crawlers (also known as “robots” or “spiders”), informing them about which pages or sections of your site should be crawled and indexed by search engines and which ones should be excluded.

User-agent: *
Disallow: /private/
Disallow: /confidential.html

In this example, we’re telling all the crawlers to avoid the content inside the /private/ directory and the page confidential.html.

The reason why websites want to hide some sections from the crawlers is quite simple. Not all the content of a website is interesting for search engines: there are administration pages, forms, and temporary content that could lead even to SEO penalization if indexed.

So every bot or scraper follows the rules in the robots file?

As mentioned before, the file contains some suggested rules on how to crawl a certain website but there’s no way for it to enforce every scraper to follow them.

Some vulnerability experts, for example, look for the content of the file searching for unindexed content that could give some hint about the website structure.

Some years ago, in 2017, The Internet Archive team announced that they would rely less on robots.txt when doing their scraping operations because, for their purpose, the changes in the robots file could lead to history integrity issues.

While following the robots.txt rules is strongly recommended as a best practice for ethical web scraping (as says, for example, the Investment Data Standards Organization), not doing so is probably not legally enforceable by itself.

Said so, every popular search engine follows robots.txt rules, and even Scrapy, the popular web scraping framework, allows in the file to choose if following them (the default behavior is not following them, BTW).

ROBOTSTXT_PARSER = "scrapy.robotstxt.ProtegoRobotParser"

What’s the difference with XML sitemaps?

XML sitemaps were invented by Google in 2005 and then adopted soon after by its competitors Yahoo and Bing.

While robots.txt is a file containing paths and rules, XML sitemaps are XML files containing the list of all the URLs available for being indexed by search engines.

It’s basically the list of all the URLs that the website owner wants to show to the web.

From a web scraping perspective, in some cases, it could be a good entry point: it’s made for being accessible to bots, and it contains all the visible URLs of a website. If we don’t care about scraping a website following its structure (breadcrumbs, pagination, product positioning on a page), it could be a good option for where to start our scraping project.

I wish this brief mid-summer post contained some notion you didn’t already know like it was for me.

My biggest surprise to me was the process of the creation of the robots.txt file. Something that a single man, involved in the early days of the web, has implemented and, with its simplicity, lasted almost 30 years.

See you next Thursday with another The Lab article, where we’ll see how to bypass some common anti-bot.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.