Cloudflare Turnstile: What It Is & How It Affects Scraping

On September 2022 Cloudflare announced its new service, called Turnstile. In the company vision, it should be a “No Captcha” Captcha, a Javascript challenge to discriminate human-generated traffic from bots, without requiring an active interaction with the website for the user. No traffic lights, vans, or pedestrians to identify, only a script that runs in the backend and makes the dirty job.

This saves the user experience on the website but there’s also a deeper reason to prefer the Cloudflare alternative to Google’s Recaptcha.

As stated in the official announcement

According to security researchers, one of the signals that Google uses to decide if you are malicious is whether you have a Google cookie in your browser, and if you have this cookie, Google will give you a higher score. Google says they don’t use this information for ad targeting, but at the end of the day, Google is an ad sales company. Meanwhile, at Cloudflare, we make money when customers choose us to protect their websites and make their services run better. It’s a simple, direct relationship that perfectly aligns our incentives.

Basically, users are not giving away their data for marketing purposes like they would do when using Google’s Recaptcha, but (probably) using Turnstile they participate with their data in the training of the Cloudflare AI proprietary model. There’s no free meal when it comes to listed companies.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.

How does Cloudflare’s turnstile work?

Turnstile checks for bot behavior by choosing from a suite of browser challenges based on client behavior and telemetry. First, it runs a series of small non-interactive JavaScript challenges gathering more signals about the visitor/browser environment. Those challenges include proof-of-work, proof-of-space, probing for web APIs, and various other challenges for detecting browser quirks and human behavior. As a result, we can fine-tune the difficulty of the challenge to the specific request and avoid ever showing a visual puzzle to a user.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.

Depending on the setup of the turnstile on the website, we have mainly two different behaviors:

  • The suggested installation basically will show a box requiring the user to click if the first tests give mixed results and Cloudflare is still in doubt.
b62ce481 c3b4 4ff4 a927 aaa7df69d7cc 1113x348
Cloudflare Turnstile: What It Is & How It Affects Scraping 7
  • There’s also a completely non-interactive installation where you get only the box with the result of the tests.
402c64db d534 40fb ac08 854b5cc470d7 1113x348
Cloudflare Turnstile: What It Is & How It Affects Scraping 8
  • Last but not least a completely invisible installation that requires a few seconds to complete all the tests and eventually locks you out of the website.

Some real-world examples of turnstiles in action can be found on the website. When accessing using a scraper, you’ll see a particular page (I think it’s called a waiting room) until the turnstile finishes all the checks on your scraper and eventually asks your scraper to click on the box to prove “he’s human”.

https%3A%2F%2Fsubstack post 030b 4b94 ab7e

After all the checks are passed, your scraper will be redirected to the home page. Browsing the website as a real user, instead, you should not see this page: probably on this website are using the default turnstile configuration.

How to safely check my scraper is able to bypass the Cloudflare turnstile?

Since applying the Cloudflare Turnstile on a website is free (as said, visitors probably will send Cloudflare their fingerprint to train their AI model, there’s no free lunch), I’ve created a Cloudflare Turnstile Tester on The Web Scraping Club website.

Loading this page with your scraper, you can check if it passes the challenge or not.

If you want an idea on how to create a scraper to bypass Cloudflare, you can have a look at our article archive about it. As always, the scraper part is only one of the aspects to consider, while the server environment and the IP quality and rotation make up the rest.

Cloudflare Turnstile Alternatives

As mentioned at the beginning of the article, Turnstile was created to replace Google Captchas, trying to steal some market share in the bot detection industry.

The Google solution that’s more similar to Turnstile is reCAPTCHA v3 , which performs checks in the background, without requiring any interaction with the user.

The challenge returns a score to the site owner who, depending on it, can take different actions on the website, like showing different pages or disabling comments.

hCaptcha is another famous bot detection solution, which differs from the other two mentioned before.

While both Turnstile and reCAPTCHA V3 rely on the direct detection of bots via challenges in the background without requiring any user interaction, hCaptcha integrates also the traditional proof-of-work by the user.

Both methods, direct detection, and proof-of-work, have pros and cons. While direct detection doesn’t destroy the user experience on the website, proof-of-work still remains the most reliable way to limit a DDOS attack.

Hope you’ve found this article interesting and feel free to play with the Cloudflare Turnstile tester. Any feedback is welcome as usual, via mail, comment, or in our Discord Server.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.