2023’s Best Web Scraping GitHub Repositories



An incomplete but still yes useful list of interesting resources on web scraping

The open-source community has significantly contributed to the web scraping industry, giving the public access to a wide range of tools and resources. In this article, we’ll see together some of the most important GitHub repositories.

blue and black penguin plush toy
2023’s Best Web Scraping GitHub Repositories 3

Tools for web scraping with Python

Scraping tools

Scrapy

The standard de-facto for web scraping in Python, the repository has 45k stars and is maintained by Zyte. On top of it, Splash headless browser allows Scrapy to render Javascript and Spidermon enables the scheduling of your fleet of scrapers.

You-get

It’s the most popular repository on GitHub returned when looking for Python projects with the “web scraping” keyword. It’s used to download non-HTML content like videos on Youtube or images from websites

Autoscraper

It’s a sort of scraper that, given some input examples, learns the rules for extracting the correct data and can apply them to new URLs with the same structure. It’s the first time I see this project but I’ll test it for sure in the next weeks.

This post is sponsored by Oxylabs, your premium proxy provider. Sponsorships help keep The Web Scraping Free and it’s a way to give back to the readers some value.

Oxylabs
Oxylabs

In this case, for all The Web Scraping Club Readers, using this link you can have a 35% off discount for Residential, Mobile, Web Unblocker and Scraper API’s plans

Linkedin Scraper

Linkedin scraper using Scrapy, Selenium, and Chromium.

Proxy management

Request Ip Rotator

This package uses the AWS API Gateway service pool of IPs as a pool of proxies. Smart move for saving some bucks when in need of data center proxies!

Scrapy Rotating Proxies

This package enables different types of proxy usage in Scrapy. An alternative could be the advanced scrapy proxies package, written by myself, where I added several options like downloading a list of proxies from an external URL at every request and using hidden users and passwords.

Other useful repo

Search-Script-Scrape

As said in the repository, they are 101 web scraping and research tasks for the data journalist, belonging to the Stanford Computational Journalism Lab. There are scripts in Python useful for extracting data typically from US government and administration websites.

Cloudscraper

The most famous python bypass for Cloudflare.

TLS-client

Brought to my attention in the comments on a post on my Linkedin Profile (feel free to add me), allows sending HTTP requests with custom TLS fingerprints.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.


Tools with Javascript

Scraping tools

Crawlee

Maintained by Apify, it’s the one-stop solution for web scraping in Js. Uses Playwright, Puppeteer, and Cheerio, adding some anti-blocking features.

Puppeteer

The most famous browser automation tool is used also in web scraping. With the package puppeteer-extra it gains superpowers against anti-bot solutions.

Playwright

Released in 2020 by Microsoft, this browser automation tool immediately gained traction in the web scraping scene. With the package playwright-extra you have more options against anti-bot. Available also in Python.

Ayakashi

A new concept of web scraping tool that uses SQL-like language for extracting data from the DOM.

Tools with other languages

Scraping tools

Geziyor

Geziyor is a web scraping framework for GO language, with JS rendering, proxy management, and some other common features.

Upton

Framework for easy web scraping in Ruby

Knowledge bases and documentation

The Web Scraping Open Knowledge Platform

It’s my first try to gather all the info and links about web scraping from several sources, so we can consider it as the synthetic version of this substack.

Browser fingerprinting

A great collection of tests and considerations about anti-bot industry and techniques.

Awesome Web Scraping

A list of interesting repositories on GitHub that is much more complete than this post.


I’m sure I’ve left behind some amazing repositories, please comment here if you want to share with the other readers something I’ve missed.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.