July 2023: Monthly Web Scraping News Digest



Welcome to the usual web scraping news recap, in this post we’ll see what’s happened in the industry during July 2023.

The mingling between web Scraping and AI

https%3A%2F%2Fsubstack post media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc7d5af84 dce9 419e a919

AI and more precisely Large Language Models used by GPT, Google Bard, and many others, left the world astonished. These are tools you can ask everything to and they return an answer, in most cases the right one. And they’re available to everyone, for free. Millions of users ran to create an account on the ChatGPT website and registered the first 100 Million users in only two months after the launch.

Even if we’re at the beginning of the mass adoption of these AI tools, we’re starting to see its consequences.

Over the past sixteen months, the Stackoverflow website lost half of its traffic, as stated in this post on Reddit, probably because programmers first started to use GitHub co-pilot and then ChatGPT to ask programming questions.

But since these LLMs use web scraping on sites like StackOverflow to generate a result, we have two main issues as fewer people are using it to get answers:

  • The training data for the LLMs is getting less and less accurate and timely as fewer people write about new technologies, issues, and so on.
  • AI is hurting the StackOverflow business. If tomorrow it will be shut down because of too few visitors, people but also future LLMs won’t have access to this data.

And these issues are common to many popular platforms, that are trying to monetize their content and limit web scraping and, instead, trying to sell expensive APIs, as Twitter and Reddit are doing.

Twitter started acting more aggressively with people doing web scraping, while Reddit wanted users to pay the usage to its API.

This post is sponsored by Smartproxy, the premium proxy and web scraping infrastructure focused on the best price, ease of use, and performance.

Smartproxy
Smartproxy

In this case, for all The Web Scraping Club Readers, using the discount code WEBSCRAPINGCLUB10 you can save 10% OFF for every purchase.

So basically we have businesses like Reddit, and Twitter, sitting on top of a mountain of data, struggling to monetize it, while AI companies, via web scraping, are using this data to train their models and raise tons of money from VCs.

This situation reminds me, even with some differences, of what happened some years ago with Google News.

In 2014 Google News shut down in Spain because the government wanted Google to pay the publishers for showing the snippet of their news in the services. Google didn’t accept and the service came back only after 8 years.

Publishers then suffered a heavy reduction in traffic on their websites (and then fewer revenues from advertising), so after a while, Google News was back in Spain after changes in regulations that allow publishers to contract directly with Google for any fee.

The pain here with AI is similar but worse. Not only AI is using web-scraped data without bringing any advantage to its source, but instead, it’s damaging it, like what’s happening to Stack Overflow. Probably if AI tools could, after the AI answers the question, link the sources used for the output, could help mitigate the damages, but I’m not even sure it’s something feasible for how LLMs work.

In any case, the success of AI LLMs gave a new focus on web scraping, with an increase in data demand but also in the usage of anti-bot protection techniques.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.


Google and the new Web-Environment-Integrity specification

We can consider the new Web-Environment-Integrity specification proposed by Google as a new way to approach anti-bot techniques, that’s creating a great disappointment not only in the web scraping industry.

Before digging deeper: what’s this Web-Environment-Integrity specification?

Basically, Google would like to integrate into Chrome proof of the “integrity” of the hardware where the browser is running.

As stated in the official GitHub repository:

With the web environment integrity API, websites will be able to request a token that attests key facts about the environment their client code is running in. For example, this API will show that a user is operating a web client on a secure Android device. Tampering with the attestation will be prevented by signing the tokens cryptographically.

Websites will ultimately decide if they trust the verdict returned from the attester. It is expected that the attesters will typically come from the operating system (platform) as a matter of practicality, however this explainer does not prescribe that. For example, multiple operating systems may choose to use the same attester. This explainer takes inspiration from existing native attestation signals such as App Attest and the Play Integrity API.

So, this solution would provide websites with an API telling them whether the browser and the platform it is running on that is currently in use is trusted by an authoritative third party (called an attester).

This means that hardware configurations and browsers could be left out of the web, if there’s not an attester who certifies them. And while for Android devices it could be Google, for Windows Microsoft, and for IOS/MacOs it could be Apple, what happens for the fragmented Unix/Linux world?

I don’t think (and I hope) this will be the new standard for the web, there are too many bad consequences and the real risk to cut off from internet many real users.

What surprises me is that the proposal comes from a company that was built around web crawling/scraping (like every search engine) and just admitted, even if it was too obvious, that used a great amount of web-scraped data for its Google Bard.

For sure, we’re living in thrilling times.

Create a new revenue stream for your web-scraped data

As costs for scraping are raising, with more anti-bot used, also revenues coming from web scraped data should rise to keep it sustainable. For this reason, we created databoutique.com, a web data marketplace where you can sell your web scraped data. If you’re already doing web scraping on some websites already listed on Databoutique, selling it also there is a no-brainer. Sign up, create your seller profile, apply to sell the data, and then send the data regularly. When someone will buy your dataset, you’ll receive the money in your bank account.

If your websites are not listed in the Databoutique marketplace, you can ask to add it using the brand-new leaderboard.

In any case, for any questions or information, there’s Databoutique’s Discord server where you can write your doubts.

Tickets for the Extract Summit 2023 are available

Last but not least, dates for The Extract Summit 2023 are finally available and it’s possible to purchase the tickets. The most important live event about web scraping will be on October 25th and 26th, for two days of workshops and speeches of the most influential people in the industry. Last year we could have a sneak peek at the new Zyte API, and understand how AI and no code are impacting the industry but, most of all, it was a great way to know other experts in the industry, everyone with a history to share.

Of course, I’ll be there and if there are some of you who will come, I’d be glad to know you in person.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.