The Lab #5 – Mastering Airbnb Data Extraction using GraphQL



What is GraphQL and why it is used so widely


Want to travel? Scraping Airbnb

The travel industry has been one of the first to be impacted by digitalization. Booking.com, one of the largest websites for booking hotels around the globe, started its operations in 1997. Edreams.com, an air travel fares aggregator, went online in 2000. Airbnb is a fifteen years old online marketplace.

All these websites have in common an high traffic volume and a huge database of data points shown to their visitors. This means that every request made by the users should be responded to in the most efficient way, to save bandwidth and time. And it’s not just a case that these three websites have in common one thing: they all use GraphQL to retrieve data to the front end.

What is GraphQL

We can think of GraphQL as an API query language, with its own syntax and grammar. But it is also the runtime engine for interpreting this language and responding to these requests.

In other words, it’s a “query language,” that provides a consistent query layer for APIs, providing a single endpoint for developers to use when making requests.

This allows you to not only query the data but also control the structure of how each GraphQL API responds.

GraphQL was developed by Facebook in 2012 and later open-sourced in 2015.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.


Some more details

But how GraphQL helps websites to expose data more efficiently?

Modern websites have dozens if not hundreds of APIs exposing a single object, with all its attributes. With a single call made via GraphQL, you can gather data from the APIs needed, including only the requested fields in the output.

8de677da 9c18 498d be90 f48beb5b08f2 3222x1560 1
The Lab #5 – Mastering Airbnb Data Extraction using GraphQL 9

In this example from the Testproject’s blog that simulates the functioning of a blog, we have 3 different APIs on our website.

The first one lists the authors, with all their details: name, address, and birthday.

The second one is the list of the posts per author and their details: title, content, and comment list.

The last one lists the followers per author and their attributes: again name, address, and birthday.

Instead of calling the three APIs separately to get the data, and getting also unwanted fields, the user makes only one request to a single endpoint, specifying in the payload the fields he needs and the GraphQL engine will provide them.

This is possible because each object and the relationships between them are defined in its schema definition language. 

For the ones of you that worked with relational databases, this operation is something pretty similar to designing the Database Entity-Relationship diagram. Mapping the entities in the GraphQL schemas allows its engine to understand where are all the information so that when it receives a request from a user, it knows which API to call to extract the fields needed to fulfill it.

The response is a JSON containing the result of the query, with the selected fields.

Web Scraping Implications

Because of its features, using GraphQL, when publicly available, to scrape a website is the preferred choice. We don’t overload the target websites with requests for HTML pages but instead, we get exactly the data we need in a JSON format, maintained by the website itself for its internal functioning.

Let’s see how Airbnb implemented it, simulating research for a place where to stay in Manhattan from 2022-11-10 to 2022-11-17, for two adults.

0f219977 3457 4861 9530 a9a3fd68fbba 2128x310
The Lab #5 – Mastering Airbnb Data Extraction using GraphQL 10

The payload sent to the GraphQL engine will look like the following.

661f0fb6 3310 402c 8628 bcc4b73307a3 2006x1538
The Lab #5 – Mastering Airbnb Data Extraction using GraphQL 11

You have surely noticed that in the rawParams list, we have the filters we’ve set in the search bar of the website, while the result contains all the data shown and much more.

3b619906 bede 4902 bd20
The Lab #5 – Mastering Airbnb Data Extraction using GraphQL 12

It seems we’re ready to implement our scraper for Airbnb.

It’s Scrapy time

The first thing is to create a Scrapy project and then define the data model for the output.

The full article is available only to paying users of the newsletter.
You can read this and other The Lab paid articles after subscribing


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.



Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.