The Art of Choosing the Right Proxy Service for Web Scraping



What is an IP address and how it works

Today we’ll have a brief follow-up of the previous post, where we talked about proxies, how they work, and their different types.

I felt I’d said enough about it until last Thursday after I’ve been at Extract Summit in London. Among the great speeches I’ve attended, the one by Neil Emeigh from Rayobyte covered some aspects of proxies and IPs that I wasn’t aware of and wanted to share with you in this post.

An Internet Protocol address (IP address) is a numerical label such as 192.0.2.1 that is connected to a network interface that uses the Internet Protocol for communication. 

It is composed of a 32-bit number, usually read and written with a “dot-decimal” notation, that splits the 32 digits in 4 octets, each divided by a dot.

d6020d21 d944 443e b59b 92c8daf462d5 1182x411
Pic By Michel Bakni – Own work, CC BY-SA 4.0

Due to the rising number of devices connected to the internet, the IPV6 protocol will increase the actual IPV4 size from 32 to 128 bits.

When a device connects to the Internet, the Internet Service Provider assigns a free IP address to it, choosing between the addresses in the range that one of the five regional Internet registries has assigned to the ISP.

What is Rayobyte

Rayobyte is a proxy vendor, they sell data center, residential and mobile proxies that can be used for scraping the web. Describing the proxy business Neil said that every potential user of proxies should be aware of two key aspects:

  • diversity, in terms of IPs located in different subnets

  • the reputation of the IPs

Let’s see them in detail and why we should be careful about them.

Diversity is key

One thing that every web scraper developer is well aware of, is that we cannot make too many requests from the same IP address in a certain timeframe, otherwise, we would be blocked.

That’s the main reason why proxy providers are used when it comes to web scraping.

But I didn’t know that some large websites like Google or Amazon, heavily targeted by bots, would temporarily ban not only your IP address but all the other 255 IP addresses in your subnet.

Let’s make an example: let’s say Amazon supports 2000 requests per hour from a certain subnet.

It means that from IP 98.0.1.1 I can make 2000 requests in one hour before getting blocked. But not only my IP will be blocked, but also IPs from 98.0.0.2 to 98.0.0.255 will be blocked from requesting data from Amazon.

But this also means that If I make 1000 requests from 98.0.0.1 and 1000 from 98.0.0.2, then all the addresses between 98.0.0.1 to 98.0.0.255 will be blocked again.

This leads to the “noisy neighbor problem”: I don’t know what the other users on the same subnetwork are doing, if they are scraping Amazon too, “burning” the total request number I can make.

Coming back to Neil’s speech, this is the reason why the diversity of the sources in the IP rotation for the scrapers (and also for proxy providers) is a key success factor in web scraping projects involving large websites.

IP addresses have a reputation

Several services offer IP address blacklisting when bad actions are performed on them, like a spam campaign or fraud.

Being on these lists impacts the IP reputation and one of the measures that anti-bot software takes to prevent bots from accessing the websites is to check this reputation.

Some years ago, 4 million IP addresses were stolen from the Regional Internet Registry of Africa AFRINIC and sold on the black market to be used for fraud and spam.

As a result, these IP addresses and others in the same subnets are almost unusable for web scraping because of their low reputation, and, even when browsing, CAPTCHAs are often triggered.

This must be considered when choosing the proxy provider for our web scraping project, and usually, when prices are too good to be true it is due to the reputation of the IP addresses underlying the proxies.

Fun Fact: buying an IPv4 address in 2021 performed better than Dow Jones as an investment.

73d032f1 6fd6 4d3b 8b4a
The Art of Choosing the Right Proxy Service for Web Scraping 5

Due to scarcity and increasing the increasing need for IP addresses, their prices on marketplaces like Neterra Cloud are skyrocketing!

If you’re uncertain if you should invest your 1000$ in the latest ape’s NFT or in some IPv4 address, I would go for the second, at least there’s a real need and an intrinsic scarcity, until the usage of IPv6 finally takes off.

Jokes aside, for today is all, thanks for reading this post.

Is any of you working on something spectacular in web scraping and want to share it with us? please write to pier@thewebscraping.club and let’s talk about it! You could be in the next interview.


Latest post in The Lab


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.