A quick benchmark from a Web Scraping perspective
A bit of context
In the Web Scraping industry, we’ve heard a lot of times about Selenium and Playwright when there’s the need for a fully-headed scraper in Python (and of course Puppeteer for JS). And it is almost ironic that two of the most used tools were built for purposes other than web scraping.
Both Selenium and Playwright are, in fact, browser automation tools, created for helping front-end developers test their work, automating tests about the websites they are building using different browsers. But what is a scraper if not an automated browser going around the web?
What is Selenium
As mentioned before, Selenium is an open-source automated testing framework used to validate web applications across different browsers and platforms. It’s a suite with several components and modules and you can find a great explanation of its history in this great blog post by Krishna Rungta.
For our web scraping purposes, what matters most is that it supports Firefox, Edge, Safari, and Chrome, via their webdrivers that need to be installed separately. A webdriver is a control interface for the browser, a sort of “remote controller” for browsers.
Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.
On a high level, a typical web scraper works like the following:
Selenium WebDriver receives a command from the scraper
Commands are converted into an HTTP request by the JSON wire protocol.
Before executing any test cases, every browser has its driver which initializes the server.
The browser then starts receiving the request through its driver.
What is Playwright
Playwright is an open-source Node.js library started by Microsoft for automating browsers based on Chromium, Firefox, and WebKit through a single API, created by the same team that was working on Puppeteer in Google. The primary goal of Playwright is to improve automated UI testing.
My two cents
Restricting the comparison between Selenium and Playwright, my personal choice falls on the second one. The easy setup and maintenance make the difference in a large web scraping project and the integration with other packages like playwright_stealth to avoid bot detections is quite straightforward. Being able to jump from one browser to another without the need to install anything makes the fixing of scrapers fast and gives plenty of options. You can also use an installation of Chrome using a persistent context, which means you can have a real user profile for the whole execution of your scraper.
I’m leaving you this great article by Scrapfly where you can see how Playwright works and some code to test it.