On web scraping platforms, open source and early days hackathons
Hi Ondra, I’m really happy to have you here, and thanks for finding time to answer my questions.
First of all, tell us a bit about yourself and what brought you to Apify.
Hi Pierluigi, happy to be here. I’m just a guy from Prague who likes fixing problems and solving challenges. I used to do that as a lawyer, but I learned to program and fell in love with it. After my first year as a professional developer, I was looking for new challenges. I was going to join a different company, but I got rejected at the last minute. So I opened a local job portal and found Apify as the first result. I saw that they did web scraping, which I found extremely fascinating thanks to my legal background, and I applied.
Frankly, I was lucky to get hired. I had very little experience as a developer back then, but the founders gave me a chance, and they’ve kept giving me chances ever since. Now, after four years, I’m the chief operating officer. But I prefer to call myself the chief debugging officer. I help fix problems in the company.
For the few who don’t know Apify, what does the company do, and what’s the value that it brings to your customers?
Apify is a cloud platform that makes it surprisingly easy to develop, deploy and run web scrapers and browser automation workflows. We want to simplify the whole web scraping lifecycle to the point where developers don’t even have to think about infrastructure, blocking, or monitoring. Everything should just work. We have a long way ahead of us before we make it perfect, but we already have over a thousand customers that trust our platform instead of AWS or other cloud providers for building and running their scrapers. It’s simply much faster to build and maintain a scraper on Apify than elsewhere.
To explain it in more detail, Apify is all about having everything you need for scraping at your fingertips. We have our open-source scraping library called Crawlee that can scrape using both HTTP and headless browsers and it automatically generates and rotates real-world fingerprints. But you don’t have to build only with Crawlee. Apify is an open platform, so you can run Scrapy as well. Or Puppeteer. Whatever you want. Then with two lines of code and one terminal command, you can deploy your scraper to the cloud and immediately run it.
The scraper will get access to our data center and residential proxies, its storage for results, and even a distributed queue for URLs to crawl. All of that automatically. No need to set up or provision anything. In the console, you can inspect the runs of your scrapers, see stats in dashboards and manage everything. Yes, you could build all of this yourself on AWS or piece it together from multiple providers, but with Apify, you don’t need to.
I’ve seen on your website that there’s a sort of Marketplace for Scrapers. Developers can create scrapers, and their clients can buy the subscription and run them on your systems, eventually adding proxies. This is truly great. Can you share some numbers on how this marketplace is going?
Yes, it’s called Apify Store, and it’s open to everyone. Any developer or company can build a scraper on Apify, publish it in the store and charge money for it. We currently support monthly subscriptions, but early next year, we want to let developers charge for the number of successful results, as well. We even use Apify Store ourselves to publish our scrapers and offer them to customers. Currently, we have over a thousand Actors in the store. That’s how we call your apps after you deploy them to Apify. The store alone generates about 15% of our revenue. It’s our fastest-growing segment these days.
Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.
It seems web scraping is getting harder, with more anti-bot software and techniques. What’s the biggest threat you see shortly?
You’re right. It is getting harder. But I think it’s good for us in the end. When scraping was easy, there was no space for platforms like Apify. Not too long ago, everyone could simply write a scraper using their favorite HTTP client, and it just worked. With the advent of complex modern websites, bot-protection services, and new companies focusing solely on scraping prevention, those trivial scrapers no longer cut it. But people still want their data. And for a good reason.
The internet is the largest database humankind has ever created and processing those vast amounts of data is crucial for human prosperity and progress. Publicly available data should be freely accessible to everyone, not hidden behind walls. Of course, there are limits to what you can legally scrape. I wrote an article about it if someone’s interested in the legal side of scraping. Still, despite captchas and bot-protection software, people see the value of public data and want to tap into it.
So I don’t see bot protections as a threat, merely as challengers in this cat-and-mouse game. Apify is in the anti-anti-scraping industry. The anti-scraping industry makes it more difficult and expensive to scrape websites. We make it easier, faster, and, as one of our users found, way cheaper. In a way, without them, we would have no purpose. The only threat to scraping I see is large platforms monopolizing the internet or governments trying to regulate it with antiquated pre-internet rules.
Apify team maintains several open-source projects, some of which are really successful. In my opinion, open-sourcing internal products have some costs but, in the long run, is the best distribution channel a company can have in the long run. If you already have a growing user base for your product, you need fewer marketing efforts to explain why your product is so cool because people are already using it. What’s your view about it?
Thanks for the kind words. Crawlee is our most successful open-source project, with more than seven thousand stars on GitHub to date. We would love to overtake Scrapy one day and have Crawlee become the most popular web scraping library in the world.
And you’re absolutely right about open source being a terrific distribution channel. At first, we did not think of it that way; we simply wanted to share what we had built with the world. Open is one of our company values. We love open source, open internet, and open access to data. That’s why we’re in the web scraping business in the first place. So we approached it a bit naively and open-sourced our SDK (now Crawlee) from the start.
We had a different problem, though. Because the original name was Apify SDK, people thought the library could only be used on the Apify platform. But that was never the case. You could always run it anywhere. So we made some big updates and relaunched the scraping part of Apify SDK as Crawlee. The community loved it, and we gained over three thousand stars in a few weeks. It far exceeded our expectations, and we’re excited to keep improving the library for our developers. People keep coming to Apify, thanks to Crawlee, every day.
The two most common programming languages for web scraping are Python and JS. What made you prefer JS instead of Python?
From what you’re saying, Crawlee seems really cool. Can you tell us a bit more about its features?
It has all the features! I’m kidding, of course, but it’s true that we’ve got to a point with Crawlee where our scraper developers are no longer requesting big features because you can scrape pretty much anything with Crawlee. The purpose of Crawlee is to help you build reliable crawlers fast. So now we focus a lot on developer experience and ease of use.
For HTTP scraping, it uses our custom HTTP client, which mimics headers and TLS fingerprints of real browsers, so you get way better success rates than by grabbing your usual client. If that doesn’t work, you can easily switch to a headless browser with Puppeteer or Playwright. And even for those, we have stealth features and real-world fingerprints baked into the default configuration. It’s all powered by our open-source Fingerprint Suite, which generates real-world headers and fingerprints on the fly thanks to statistical modeling.
Then it handles proxy rotation for you, but not in some random or round-robin way. It pairs a proxy IP with a website’s cookies and a generated fingerprint and then uses those attributes together for optimal results. If a session gets blocked or performs poorly, it’s discarded, and a new proxy plus fingerprint combo is used.
Then there are the usual things like automatically handling queues of URLs, storage of results as JSON or CSV, automatic management of headless browsers, TypeScript support, automatic scaling based on system resources, and various extraction and parsing utilities, like automatic discovery and enqueuing of URLs and so on. We were slowly adding those features over five years. Each time we got bored of doing something repeatedly in our crawlers, we turned it into a feature in Crawlee.
In recent market research about the web scraping industry, I’ve seen that 75% of the expense in web scraping is in internal projects inside companies instead of buying external pre-web-scraped data. In your opinion, why does it happen? Are the companies selling datasets missing something? Or is web scraping only the first step of a long value chain that needs to integrate industry expertise before being ready to use?
I think there are two main reasons. First is the fact that for a long time, web scraping was understood as something you don’t do publicly. It’s much better now, but the idea of web scraping being something shady still lingers, and I think many companies prefer the privacy of in-house scraping.
The second big reason is that scraping used to be easy, so there was no need to contract a specialist for it. You just told one of your devs to write a scraper, they googled a tutorial, and you had your data in no time. You also did not need any special tools for it. But now, with anti-scraping protections, headless browsers, and super complex and resource-heavy websites, companies are increasingly reaching out to specialists for help.
But the reasons you mentioned are also valid. Pre-scraped datasets are often outdated or miss some critical data points. And even if they’re good, industry expertise is absolutely vital to turning the raw scraped data into useful insights. We work very closely with our customers to make sure they get the most value out of web scraping, and we see that many companies are still in the experimentation phase. They would like to scrape to get a competitive edge, but they don’t know exactly what they’ll do with the data once they have it. We learned the hard way that the most successful projects were the ones where we established a deep partnership with the customer.
How do you see the web scraping tools industry in the future?
I’m sure machine learning will play a big role in the future of web scraping. It’s already a powerful tool to extract highly standardized data like products or articles. It will never be 100% successful, though, so for mission-critical workloads, programmers will still be needed to create highly performant, cost-effective, and reliable crawlers.
Regarding proxies, we see a lot of commoditization already. Not too long ago, only one or two companies offered residential proxies. Now they’re all over the place. Some compete with price, others with enterprise features, but I wouldn’t be surprised if a big private equity fund tried to consolidate the market in the near future.
Our bet is on stellar developer experience. We think developers will naturally gravitate to the tools and platforms where they can be the most successful, build their scrapers the fastest and spend the least time maintaining them. Some will use AI tools; others will use Crawlee, Scrapy, or maybe something completely new. But all of them will need computers to run the crawlers, proxies to connect through, storage to save their data, and good monitoring to keep their crawlers healthy. We want to give them this and then just let them build.
Any fun facts about the early days you want to share?
Oh, I wouldn’t know which one to pick. Maybe when I first showed up for an interview, opened the office door, and there were 8 people in a room the size of a closet? Or that we did not have a coffee machine because it would have been too noisy in a closet, so we made filtered coffee instead? Or that when we finally got an office with two rooms, we grabbed our tables, monitors, and chairs and personally carried them across Prague city center to the new office?
The early days were absolutely ridiculous. Every month we would go for a hackathon to the hometown of one of the founders, but the WiFi router there could not handle 10 people at the same time, so we had to turn off the WiFi on our phones and switch whose laptop gets access to the internet. I’m not sure if other people would consider this fun, but I loved the punk vibes of those times. The company is very different now, but I think we’re still keeping this hacker culture alive.