3 Key Steps to Consider Before Crafting Your Web Scraper



The checklist for your web scraping projects

Before starting coding your scraper, a good target website analysis could save you a lot of time.

  • CHECK THE TECHNOLOGY STACK OF THE TARGET WEBSITE

I usually do a double check to have a rapid understanding of the website and to identify known anti-bot solutions.

893fbdf5 5a6f 45ac 864b
3 Key Steps to Consider Before Crafting Your Web Scraper 9

To do so, I’ve installed the Wappalyzer Chrome Extension, usually, under the security section, there’s an indication of the anti-bot software used.

Usually, I double-check the results with a Discord Bot, developed by the team behind Puppeteer, installed inside this server.

In this example, it shows fewer results than the Wappalyzer Extension but when results diverge, this bot is usually right.

6be943f0 e576 46c1 af37 eabd08458a29 669x385 1
3 Key Steps to Consider Before Crafting Your Web Scraper 10
  • LOOK FOR API TO GET DATA FROM

APIs are great friends of the web scrapers since they’re much more reliable and less prone to changes over time. Using APIs is also more efficient and this can be an advantage also for the target server.

fb492c19 9852 4976 8ca4
3 Key Steps to Consider Before Crafting Your Web Scraper 11

Looking for API is quite simple, all you need is the Inspect developer tool and tab network. In the case of e-commerce, you can go to the product catalog page and see if, when going from the first page of it to the second, there’s an API called, like the following example.

In this case, the result is a JSON containing all the products on page 2 of the category accessories for a man. Of course, also the APIs can be protected by anti-bot software so we cannot celebrate yet.

  • LOOK FOR JSON INSIDE THE HTML

In case API calls are not replicable from the scraper, we could always have a look inside the HTML code of the website and look for JSON formatted data.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.


We preserve the advantage of using the JSON for scraping, so less prone to changes over time, but it’s not as efficient as calling the APIs directly, since we need to load the whole page.

35b2f5e8 b7db 4253 a039 300a0d211a3b 1734x805
3 Key Steps to Consider Before Crafting Your Web Scraper 12

In this case, we have the pagination JSON with all the products displayed inside an HTML tag.

Given these three technical pieces of advice, the next one is the most important.

Ask yourself: “Am I allowed to scrape this website? Am I breaking any law about privacy, infringing any copyright doing so?”

Only after giving yourself the answers, you can proceed (and in the next posts We’ll see how to frame correctly the answers since are not always easy to get)

Thanks for reading The Web Scraping Club! Subscribe for free to receive new posts and support my work.

 

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.