Proxy Servers Demystified: What They Are & Why They Matter

Definition of a proxy server

Straight from Wikipedia, “In computer networking, a proxy server is a server application that acts as an intermediary between a client requesting a resource and the server providing that resource”. 

In other words, is a server application that stands between you and your target server: you can send a request to the target server through the proxy server, so the machine that sends the request to the target server is the proxy and not your client.

Proxy Server
Proxy Servers Demystified: What They Are & Why They Matter 9

Why do we need proxies for web scraping?

There are several ways proxies can be useful when scraping.

  • The target website has a sort of geo-fencing, so it shows data depending on the location of the IP address. This is a common case and choosing a proxy with an address belonging to the country we need the data from is quite a common case.

  • The target website limits the requests from the same IP, in this case, we’ll need a pool of proxies or a rotating proxy to split the requests into several different IPs

  • The target website uses anti-bot software that employs fingerprints to block traffic from servers. If our scraper is running in a cloud environment, we’ll need a residential or mobile proxy to bypass it.

  • The target website blocks all the requests coming from the IP range of our cloud provider. In this case, we’ll need an Elite Proxy to mask the origin of the requests.

Different types of proxies

As we have seen in the examples before, proxies can be categorized using the feature we want to highlight.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.

Usually, when comparing the different proxies vendors or when we find online a list of free proxies, the main features are anonymity, its location, and if it’s static or rotating.

Proxy anonymity

Usually, this information is provided in the free online proxies lists like this one, where the proxies are divided in: 

  • Transparents or No Anonymous. So-called because they do not change the starting request headers. On top it adds the “X-Forwarded-For” header containing the origin IP and the “Via” header saying that a proxy is being used. 

  • Anonymous: Same as before but without the “X-Forwarded-For” header 

  • Highly Anonymous or Elite: it replicates the original headers without adding anything.

Proxy location

Every IP can be traced back to a country of provenience, so as e-commerce websites can geo-fence any IP, also proxy providers usually sell proxies located in the main countries of the world.

On top, the most important players in the field allow their customers to choose the final location of the proxies, like a data center, a mobile phone, or a residential IP.

Data center IPs are the most common and cheap but also they are likely to get banned from websites that use strong fingerprinting techniques to block bots or IPS that are in a well-known range of data center addresses.

From the tests on, we can see how easy is to detect a data center IP.

I made a test, setting up a Highly Anonymous Proxy on an Ec2 AWS instance and this is the result of the IP API.

ce7242b7 9131 4a19 b042 a8ee0fc0c482 1142x519 1
Proxy Servers Demystified: What They Are & Why They Matter 10

As you can see, combining data from several sources and public registries, it can be easily detected that the request is coming from an EC2 instance.

The TCP-IP Fingerprint test also gives interesting results.

b6a32c0f 36b1 404a a776 cb67844e70ca 1174x761 1
Proxy Servers Demystified: What They Are & Why They Matter 11

These are the results when browsing from my Mac without using proxies. It can be detected with a fair degree of certainty that I’m using a Mac Os operating system and this is because every operating system uses different values inside the TCP Handshake packet sent when establishing a connection with a server.

In fact, when turning on the proxy on my Unix server, the results of the test change, and, with such an uptime value, it’s a red flag signaling that the request is coming from a server and not a personal client.

22c6a5f7 230c 410d 9e79 c959ee799dfd 1295x861 2
Proxy Servers Demystified: What They Are & Why They Matter 12

To mimic a more realistic real user sending requests to a website, the most important proxy providers give the option to use proxies with residential IP, so installed outside data centers, or mobile proxies.

Of course, due to scarcity and the cost of the bandwidth, these options are more expensive than the data center’s one.

Static Vs Rotating

Using a static proxy means that your requests will be made all from the same IP of the proxy. This doesn’t help in case the website limits the number of requests per IP in a certain timeframe.

The solution is a rotating proxy: it means that you set up your scraper as if you’re using only one proxy but, in the background, every request sent to its address is routed to a pool of proxies.

The target website will see these requests coming from different IPs and machines and will not trigger any block.

Where to find the right proxy for me?

There ain’t no such thing as a free lunch, even in this field.

For a list of free proxies, you can have a look here but tendentially they are unreliable and slow.

If you’re willing to pay for proxies, there are many actors in this industry, the most famous are:

Key Takeaways

For today it’s over, please let me know in the comments what you think about this post and feel free to join our community of web scrapers professionals on Discord if you want a more direct interaction.

Is any of our you working on something spectacular in web scraping and want to share with us? please write to and let’s talk about it! You could be in the next interview.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.