Interview #9: A Deep Dive – Interview with Uriel Knorovich on Web Scraping

Welcome to our monthly interview. This is the first article for AI Month, in collaboration with Nimble. So there’s no better start than talking with its founder, Uriel Knorovich.

https%3A%2F%2Fsubstack post d129 494c 97f0
Interview #9: A Deep Dive – Interview with Uriel Knorovich on Web Scraping 3

Hi Uriel, thanks for joining us at The Web Scraping Club, I’m really happy to have you here.

  • First of all, I am keen to learn about Nimble and yourself.Nimble stands at the forefront of web data collection innovation, pioneering the first AI-driven web scraping browser. This significant development streamlines the traditionally complex web scraping process.Our founding team came from Israel’s intelligence forces, where we handled highly classified cyber intelligence projects. After several years in the cyber industry, we spearhead deep-tech initiatives in our field, bringing novel solutions and delivering real value to busy developers.
  • You’re a successful entrepreneur and an expert in this industry: what made you enter the web data industry? Why are you still working in web data extraction?As far as I can tell, building a web scraper for one website is not rocket science. This is a question I often encounter, especially considering our cyber-focused background. Many people were somewhat perplexed when our group of exceptional engineers transitioned from the cyber world to web scraping.Early on, people wondered in what areas Nimble saw the potential for innovation. Data extraction seemed to be just straightforward Python or Node.js scripts, which is usually okay if the project scope is small, but problems arise when a greater scale is needed.Prior to entering the data collection space, we had substantial experience navigating the complexities of web scraping as cyber engineers. There were numerous technical obstacles, and maintaining effective scraping infrastructure was a constant cat-and-mouse game.We utilized various open-source projects, service companies, and proxy providers. To maintain access to data sources, we had to patch Puppeteer and Selenium repeatedly. Parsing XML or BeautifulSoup was never fun, and we always had a standby proxy provider. In light of these many difficulties, we resolved to create the platform we had always envisioned – one that would make web scraping a more streamlined and efficient process.

This is the month of AI on The Web Scraping Club, in collaboration with Nimble.

During this period, we’ll learn more about how AI can be used for web scraping and its state of the art, I’ll try some cool products, and try to imagine how the future will look for the industry and, at the end of the month, a great surprise for all the readers.

Nimble Logo

And we could not have a better partner for this journey: AI applied to web scraping is the core expertise of the company and I’m sure we will learn something new by the end of the month.

  • Nimble had built an AI-powered AI-powered web scraping browser. That’s an ambitious and expansive project. Can you share the story behind it?Our venture to build Nimble was driven by a need to address web scraping challenges on a fundamental level. Traditional browsers, often borrowed from QA spheres, struggled with efficiency and fingerprint management. We saw the need for a more reliable, long-term solution, rather than just patching existing tools like Selenium and an open CPD interface. Convincing our investors that building a browser from scratch is worthwhile wasn’t easy. They balked at the idea, comparing us to tech giants like Google and Apple, who had developed browsers. We knew it wouldn’t be simple, requiring intensive low-level coding, lengthy compilation time, and countless engineering hours. Looking back at the years spent developing Nimble, we’re proud of our innovative solutions and commitment to tackling industry challenges. Our ambitious endeavor produced incredible results, and we provide great value to the industry. It is a testament to our commitment to deep technology and solving complex problems head-on.
  • You were working on controlling this browser with AI, long before all the hype around LLM and GPT. What made you think so?Early in our journey, we recognized the demanding nature of grinding tasks and the constant transformations of websites and APIs. This ongoing evolution often led to inefficiencies and frustration. Seeing the potential of AI in addressing these issues, we decided to integrate AI and, in particular, LLMs into our platform. The goal was to keep up with the dynamic web and automate and streamline scraping.It wasn’t about jumping on the AI hype. Training LLMs is a complex process that requires substantial computing resources.AI, and specifically machine learning algorithms, appear to be the most effective solution for managing repetitive tasks. In addition, they adapt to diverse and evolving website structures. Furthermore, AI’s ability to learn and improve over time positions us to navigate the ever-changing web effectively, and we look forward to leading the charge in the industry and to exciting developments.
  • You’ve been involved with web scraping for a long time. How did you see the industry and its challenges evolve in the AI era?Web scraping has transformed with AI, automating many previously manual tasks and improving efficiency. However, as websites have become more dynamic and anti-bot mechanisms more sophisticated, challenges have also evolved. AI has helped navigate these through machine learning algorithms, which adapt to changing websites and manage proxies. However, the increased demand for quality data to train AI models and the need to keep up with advanced anti-bot technologies present new complexities. The future will require ongoing adaptation to these trends and challenges.
  • Scraping a website usually consists of two phases: getting to the HTML (avoiding the anti-bot, if any) and then parsing the HTML. Do your services apply AI for parsing HTML or also to avoid anti-bot measures?Absolutely! Nimble incorporates AI in both aspects of the scraping process. We employ AI algorithms to intelligently manage proxies, mimic human-like interactions, and adapt to varying site behaviors for navigating anti-bot measuresAI-based algo is at the core of the Nimble Browser fingerprint engine. This is a critical part of preventing CAPTCHA challenges from being triggered.When parsing HTML, AI becomes instrumental in recognizing and adapting to different website structures. This allows for more effective and efficient extraction of information. Our overarching goal is to make web scraping seamless and robust.
  • One of my concerns about using AI in web scraping is that basically, you’re using a black box. If your selector in a standard scraping project is not returning the value you desired, you can fix it. If you’re using a third-party service without any chance to review it, if it returns the wrong value, you simply cannot use that service. Is it something you’re working on?Absolutely, we understand your concern, and this is something we’ve seriously considered. When developing AI-powered scraping solutions, we don’t just aim for automation but also for transparency and control. It’s essential to ensure that users can diagnose and address issues when things don’t work as expected. We’re continuously refining our system to improve visibility and manual fine-tuning options. Our AI models are designed to work with you, not just for you. This enables a feedback loop where your insights can improve the system’s output. The goal is to ensure that even while benefiting from AI efficiency, users retain the ability to intervene and control the scraping process.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.

  • Let’s jump into the future: in five years, where the web scraping industry will be, according to you? Will web scrapers programmers still need or all the programs will be written by AI?AI will streamline web scraping, but it won’t fully replace human programmers. Think of it as a super smart sidekick who needs a human to guide the strategic decisions that need to be made. So, no need to retire your programming gear; the future still needs your coffee-fueled coding brilliance!
  • My idea is that, while the adoption of web data in companies is literally booming in these years, we’re just at the start of the adoption curve, so it’s an opportunity that can’t be missed.Indeed, you’re spot on! We’re only at the dawn of this data-driven era, with a tremendous surge in businesses harnessing web data. This is an opportunity that should not be overlooked – it’s not merely an opportunity, but it is the path forward. It’s an exciting ride, and we’re all in it together, ready to ride this transformative wave.However, let’s not forget the gap between fascinating AI demos and real-world implementations delivering value. The journey towards consistent and cost-efficient data scraping using AI is far from over. As thrilling as the field’s advancement is, maintaining data integrity and efficiency remains a significant challenge we are eager to conquer. It’s a fast-paced ride, but there’s still a lot to cover!
  • Our usual latest question: I’m sure you have a lot of funny stories about the early days of your career. Can you share some with us?Indeed, the early days were full of curiosity and naivety! One fine day in high school, we ambitiously decided to scrape all student names and addresses from the school website. In our enthusiasm, we overloaded the school’s website with our data requests. Later on, we learned the professional term and that we are practically staging an accidental mini-DDoS attack.This little adventure caught the school’s attention; unsurprisingly, our parents weren’t thrilled about it. The punishment came in the form of revoked computer privileges and endless chores. That was a crash course. Looking back, this funny hiccup reminds us of how far we’ve come and the responsibility that comes with our skills.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.