THE LAB #26: From internal API to insights.



Internal API?

When approaching a new scraping project, a good study phase is desirable if not necessary. One of the first steps is to understand how the website works: if it’s a website with dynamic content, like products on e-commerce, it means understanding how this data is gathered.

Very often, this is done via APIs: according to the page you’re loading, a request to an internal API endpoint is made and the results are shown on the front-end. Later in this post, we’ll see how to spot these APIs with your browser and use them to scrape data from a website.

If APIs are available and they return all the data your scraper needs, these should be used by the scraper. APIs are usually more stable than HTML code, they’re made to be queried (with proper throttling), and there’s no overhead in the responses, making the scraping more lightweight on both the server and the bandwidth aspects.

And when no API is available?

In case there are no APIs available, we should check the HTML code and look for some JSON containing the data we need. It’s not rare, especially if websites are developed in Next.JS, to find in the HTML some tags like

<script id="__NEXT_DATA__" type="application/json">

and then the JSON containing the data that populates the dynamic part of the web page.

This is the second-best approach for web scraping: using the JSON embedded in the HTML code, while there’s no advantage in bandwidth, at least it should be more stable than simple HTML scraping.

Last but not least, if there’s no API or JSON available, we are obliged to proceed with writing our selectors for the plain HTML code.

How to find internal APIs on a website

As we said before, when there’s some dynamic content on a website, there’s a chance that it’s loaded by an internal API and we can intercept it using the browser.

The way I usually observe what’s happening under the hood of a website is by opening the browser’s developer tools, on the network tab.

The full article is available only to paying users of the newsletter.
You can read this and other The Lab paid articles after subscribing


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.


Link to the full article: here


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.