On these pages, we often try new techniques and tools to bypass anti-bot software that prevents scraping and today it’s the turn of the new version of undetected-chromedriver. Of course, no need to say we’re talking about the legitimate use of web scraping techniques on publicly available information.
Sometimes it happens that solutions provided here don’t work in different context and times, which lead to a reflection I want to share with you.
The Web Scraping Triad
When building a web data pipeline, especially if we need a recurring stream of data and not only a one-off extraction, just thinking about the tool for scraping is not enough.
Of course, the first aspect we need to understand is which instrument is right to scrape the target website.
If there’s a modern anti-bot, probably we’ll need a browser automation tool like Playwright or Puppeteer, while on simpler websites a Scrapy project could be enough.
But once we found a solution that works on our laptop, the job is not over. We must be sure that the solution works also on the target scraping environment, typically a server on a datacenter. And this is not to take for granted, since anti-bots, via the browser’s API, are able to detect the hardware where the scraper runs and block it if they found it’s running from a datacenter.
In some cases then you need to iterate to choosing the right scraping tool once you’re testing your solution on the target hardware, but some other times adding the right type of proxy could be enough. It could happen that running your scrapers from a datacenter but adding residential or mobile proxies could be enough to don’t trigger any red flags from the anti-bots, but it really depends from website to website, since anti-bots can be configured to be more or less sensitive.
undetected-chromedriver 3.5 stress test
The idea for this introduction came to my mind some days ago, while reading the readme file on the GitHub repository of undetected-chromedriver. The main contributor was complaining about the fact that many issues were opened by people who kept in mind only the first element of the triad (tool) but didn’t care about the hardware or the need for proxies.
For this reason, I wanted to create a complete test again the usual five anti-bots we use as a benchmark (Cloudflare, Datadome, Kasada, PerimeterX, F5), in different setups.
The first run will be a test from a local environment, the second one on a datacenter without any proxy, and the third one will be from a datacenter using Smartproxy Residential proxies.
The scraper will load the main page of the websites (e-commerces) and browse around for some pages, to see if it gets detected after some time.
Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.
We make the scraper load the harrods.com main page, then the women’s shoe page, and then scroll until the last page (page 61 today).
We make the scraper load the footlocker.it home page, then go to the newborn category and again scrape until the last page (11 today)
We make the scraper load the canadagoose.com home page, select an item category and then make the page scroll down to the end until all the items are loaded.
We make the scraper load the neimanmarcus.com home page, select the women’s jeans category and then load all the pages.
We make the scraper load the nordstrom.com home page, select the women’s espadrillas category and then load all the pages.
First run – undetected-chromedriver on a local environment
I’ve put together all these tests in a Python script called tests_local.py, which can be found in our GitHub repository and available to everyone.
- Cloudflare bypassed with a plain vanilla script
- Datadome was the hardest one to bypass, and I’ve been able to do it only partially. I used Brave software hoping it would leak less sensible information and using it I could load the home page, a thing I could not do with Chrome. But I could not go further in loading another page, probably the anti-bot wants a more “human-like” behavior but it’s not the scope of this post to find a way to bypass it. This is the behavior of websites heavily stressed by bots, typically selling sneakers (thanks sneakersbot community).
- Kasada again bypassed without any issue with a simple script, where the only fancy thing needed was to scroll down to the end of the page for loading all the items.
- Perimeterx and Neiman Marcus websites are tricky. Using Chrome the spider get stuck loading some script, so I used Brave. Probably it’s an issue of the website and not related to the anti-bot. After some try, my IP got blocked for several minutes and this is a sign that in a production environment, we would probably need some proxies
- F5 is again a piece of cake to bypass with a plain script.
Second run – undetected-chromedriver on a datacenter, without proxies
Let’s repeat the tests made before but from a VM inside a server on AWS, tests in the file tests_remote.py.
Cloudflare blocked on the home page with their verification challenge
Kasada stopped working, showing a blank page both with Chrome and Brave when loading the home page.
The reason for all these failures is simple: AWS publishes the ranges of the IPs of their datacenter so for anti-bots it’s very easy to understand if the traffic is coming from the AWS Cloud. Assuming that this traffic is coming from bots, they raise some red flags then they see an IP belonging to those ranges.
Third run – undetected-chromedriver on a datacenter, with residential proxies
Adding an authenticated proxy on undetected-chromedriver is quite a pain in the back, so I cheated a bit. I’ve whitelisted my IP in the Smartproxy console so I could use only the endpoint as a proxy, without the need to pass username and password.
The modified script is called tests_remote_withproxy.py and runs as before from a VM machine in a AWS datacenter.
With this configuration, we have mixed results, depending on how much the anti-bot relies on device fingerprinting rather than IP-related data.
In fact, Cloudflare, Datadome and Kasada blocked us even with residential proxies while PerimeterX and F5 are bypassed.
Here’s a recap of the results of these tests.
Undoubtedly undetected Chromedriver is a great tool, and we can see it when running locally. Every anti-bot tested is bypassed and, being a free and open-source tool, this helps greatly freelancers and occasional scraping.
But when we need to scrape on a production environment, we need to deploy our scrapers deployed in datacenters mainly and more and more advanced fingerprinting techniques are tackling hard these configurations. Even if we use residential proxies, which help us bypass PerimeterX and F5, the hardware fingerprinting is still too strong for being handled by Undetected Chromedriver.
Alternatively, we can check from website to website what’s the best tool, as an example in our The Lab series we have seen how Cloudflare could be bypassed with Playwright.
It’s challenging but less than changing the traditional data pipeline by deploying the scrapers not in a datacenter but in workstations hosted somewhere.