Exploring the State of Web Scraping in the AI Era

Web scraping and AI: the shovels during a gold rush

Web scraping wild west
Web scraping wild west

During the past months, thanks to the release of ChatGPT and the subsequent ones of competitors’ LLMs, AI is undoubtedly a hot topic, not only in the tech industry.

From entire categories of workers having fear of losing their job to new businesses and roles opening, like the prompt engineer, I cannot remember a technology that shacked so much society in so little time. To be fair, AI is not anything new: in its general terms, it has been around for decades, only it was almost hidden under the hood of many products of daily usage. From those magic “adjustments” in the photo app to the autocompleting the sentence in Gmail, to the co-pilot in GitHub, it’s all about AI.

Probably now AI gained the mainstream because OpenAI released a product like ChatGPT, open to everyone, where users have the sensation to be able to ask anything and get a correct answer. We’re not exactly at that point, but I understand the feeling of being in front of something really powerful and new.

But why I opened this paragraph talking about shovels and the gold rush? Well, in order to provide you with an answer to every question, LLMs are trained on a huge amount of data. And guess what’s the largest source of information today? Of course, it is the Internet, and web scraping is the way we can extract data useful for AI from it.

Happened last week that thousands of subreddits went “private” because of Reddit’s update on the API pricing plan. Basically, Reddit’s API has always been free and this helped the creation of several services for moderators of subreddits that helped their tasks, made on a volunteering base. Charging for the API usage means also making payments for these tools, making the moderator work much more difficult.

But why this choice by Reddit? Because of its dimension, the broadness of the arguments discussed on it, its multi-language scope, and the most written long post form of its content, it’s for sure one of the most important sources for all the LLMs models like GPT3 and 4. So the new CEO realized that he was gifting shovels to gold miners and tried, once most of the miners have already mined what they needed, to make them pay for future digging operations.

But this is the exact point of AI: if you wish to train a new model, you need data and most of the time, this data will come from web scraping.

This is the month of AI on The Web Scraping Club, in collaboration with Nimble.

During this period, we’ll learn more about how AI can be used for web scraping and its state of the art, I’ll try some cool products, and try to imagine how the future will look for the industry and, at the end of the month, a great surprise for all the readers.

Nimble Logo

And we could not have a better partner for this journey: AI applied to web scraping is the core expertise of the company and I’m sure we will learn something new by the end of the month.

The web scraping momentum

The following chart from Google Trends about the term “web scraping” says it all.

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2F91bb2fd0 218e 42c8 b394

The AI needs for data, combined with more and more portions of the economy moving online, a shift accelerated by COVID-19, make web scraping a hot topic itself, even if still under the radar.

When talking to other people in tech or business, have you ever had the feeling that when you mention you’re doing “web scraping”, they don’t really get the technical skillset needed for doing it and the importance it has nowadays?

Web scraping is as old as the internet and for a long time has been seen as a gray area dangerous to touch. And probably most of the people made some project by themselves, maybe with some no-code online tool, and found it easy enough to say that web scraping, in general, is easy.

Continuing with our analogy, we can definitely say that anyone can build a handcrafted shovel, but creating a shovel industry capable to satisfy the need of the market it’s something completely different.

And here we come to the web scraping intrinsic issues: for its nature, web scraping requires great effort when made at scale. It’s human capital intensive for both writing the scrapers and for quality controls, and the costs of bypassing challenges by anti-bots keep rising. There’s not a single tool capable of extracting data from every website around the world, but today anyone who’s willing to build shovels needs to start by cutting wood, forging the iron, and ensemble the pieces, each with different quality on the results.

On top of that, web scraping is still not productized: while extractions of product data from e-commerces follow some patterns in the output data structure, every company involved in web scraping makes (or buys) their own extraction with the data model they decided. This leads to hundreds of scrapers targeting the same website, looking for more or less the same data.

Can AI help feed the AI?

But things keep evolving and, as it happens from the beginning of human history, when there’s a challenge, some people see an opportunity.

During the past years, many companies started to release tools to make the life of web scrapers’ programmers easier. More and more tools for bypassing challenges from anti-bot came to market, with a good success rate, but there still was a missing piece of the puzzle.

While using these tools antibot is no more a problem, we still need to write hundreds of web scrapers if we want to scrape hundreds of websites, and this is a limit to the efficiency of the web scraping industry.

If you’ve read the article written in June, you already know what I’m going to say: this situation, thanks to AI applied to web scraping, is beginning to change. Companies like Nimble have started to release services that, leveraging their AI models trained on the HTML code of websites, try to bypass anti-bot challenges and, at the same time, parse the code extracting as much information as possible.

We have seen on the ground these solutions perform well on Amazon website and Walmart, but also on smaller websites with a bit less accuracy. With further developments, these solutions could lead to a future where web scraping can really be industrialized, lowering the entry barriers and spreading its adoption.

Basically, Nimble is building the machine for enabling the mass production of datasets for AI but also for all other use cases the web data is needed.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.

A more efficient web scraping industry

Let’s recap the situation. We’re at the beginning of a new gold rush (AI) where those who sell shovels (web data) can enable gold discovery. The production of shovels is actually starting to be industrialized by AI-powered tools, like Nimble’s ones, but there’s still a missing piece of the puzzle.

Building shovels at scale is not for everyone: if you want to rent a machine that produces 1000 shovels a day, you need to find out the buyers for those shovels.

Today the web scraping market is very fragmented: there are many freelancers around, who need to attract customers on the different portals, and companies who offer managed web scraping. In both cases, the buyers set the specs and it huge market of custom shovels, many targeting the same websites, and this is highly inefficient. It means much more traffic on the websites, which need a larger infrastructure, a larger piece of bandwidth consumed by bots, and many human hours spent on something that someone else around the globe has already implemented. On top, custom work is also much more expensive than industrialized products.

Being in the industry for a decade, I’ve seen by myself how this way of working leads to great inefficiency in the market. That’s why, together with 

Andrea Squatrito we decided to create Databoutique.com, a web data marketplace where people can buy and sell standardized high-quality web data. Few sellers are in charge of maintaining a web scraper with a standardized data structure. Databoutique checks the quality of the delivered data and makes marketing efforts to multiplicate the sales per dataset per each seller. With more buyers per dataset, its price can be lowered to attract even more buyers. And sellers with more money in their pockets can leverage new tools that make their life easier, like Nimble.

To conclude with our analogy, Databoutique is the Home Depot for shovels. It’s placed near the major road directed to the gold mines, a place that small shovel producers rarely could afford. They instead bring their shovels to Home Depot to sell and, since the sales ramp up, they can afford more sophisticated machines for shovel production and produce more and more, focusing on the production and not worrying anymore about marketing and online presence. Buyers directed to the gold mines in the west, instead, know that there’s a place where they could find any type of shovels needed, with guaranteed quality and at the cheapest price available. And since shovel production is made in a more efficient way, even the Earth is happier.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.