Navigating the Shift: Embracing AI in Web Scraping

To continue our journey in the AI month, today Uriel Knorovich by Nimble tells us what’s the company approach to AI and how AI can be trained for web scraping. It’s an interesting insight view I recommend reading.

https%3A%2F%2Fsubstack post media.s3.amazonaws
AI and web scraping

Since the value of web data has been realized, a cat-and-mouse game has been set in motion. Web scrapers innovate new ways to overcome anti-bot defenses, and new defenses are then installed to address them. The complexity of the challenge became so great, that it became obvious to us at Nimble that a new, fundamental technology is needed if web scraping at scale is going to remain feasible.

Our vision was to reimagine what a browser could be and to infuse it with the latest and greatest technologies that were already dramatically impacting all aspects of human life: AI.

The Past: Remembering Traditional Methods

Web scraping initially rose to prominence on the back of traditional headless browsers such as Selenium, Puppeteer, and Playwright. They were the pioneers, allowing us to automate web browsing and extract valuable data at an unprecedented scale.

However, the ride wasn’t always smooth. These methods presented some formidable challenges. Anti-bot defenses quickly found a multitude of methods to detect if a client was using an automated browser, and blocks became increasingly frequent. Fingerprint leaks became increasingly complex, nuanced, and difficult to plug.

These browsers also consumed substantial computing resources, leading to additional infrastructural requirements and subsequent costs. To stay in the game, we had to frequently implement patches, a time-consuming endeavor that diverted attention from actually applying data-driven analysis and insights.

The Need: A Web Scraping Browser

Having experienced these challenges firsthand, the team at Nimble envisioned the next evolution of automated browsers. Our dream browser would be able to:

  • Control and manage its fingerprint intelligently
  • Integrate seamlessly with proxies to overcome geo-restrictions
  • Perform at a high level and scale to handle any workload
  • Possibly even parse HTML dynamically with AI

The tool we imagined would be robust and smart, capable of adapting to the ever-changing landscape of the massive public web.

The Technology: Introducing the AI Engine

Having worked many years to realize our vision, Nimble is proud to introduce the Nimble Browser – powered by our in-house AI Fingerprinting engine.

The AI engine at the heart of the Nimble Browser harnesses an expansive dataset generated by Nimble Lab to select the best fingerprints for a request. It anticipates and responds to the dynamic challenges presented by anti-bot systems, effectively combating blocking issues.

This is made possible by the AI engine’s ability to exert control over every parameter within the fingerprinting scope, including the operating system, browser graphics, extensions, installed fonts, sub-pixel hinting, GPU stats, and more.

This is the month of AI on The Web Scraping Club, in collaboration with Nimble.

During this period, we’ll learn more about how AI can be used for web scraping and its state of the art, I’ll try some cool products, and try to imagine how the future will look for the industry and, at the end of the month, a great surprise for all the readers.

Nimble Logo

And we could not have a better partner for this journey: AI applied to web scraping is the core expertise of the company and I’m sure we will learn something new by the end of the month.

The Process: Building a New Browser Engine from Scratch

Developing the Nimble Browser and its AI Fingerprinting Engine involved two main challenges:

  1. The development of a high-performance, highly compatible browser from the ground up that would maintain data accuracy at scale and allow for deep control over its fingerprint.
  2. The training and optimization of an AI engine that would produce reliable, effective fingerprints and overcome anti-bot defenses.

Developing a web browser from scratch is no small feat. Modern web browsers are incredibly complex and rely on a multitude of technologies to access and render web pages. Nimble Browser had to be at the forefront in order to accurately mimic existing popular browsers, support the diverse array of current web technologies, and maintain data accuracy at any scale.

Although we learned a lot from existing browser projects like Chromium, we decided to develop Nimble Browser from scratch to ensure full control over all parameters that could influence the final fingerprint.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.

AI Fingerprinting: Training the Model

In terms of the AI engine, there were two datasets that we needed in order to train our model – fingerprints and website profiles. Gathering fingerprints was relatively straightforward, but gathering profiles of various websites and their fingerprint-testing strategies was more challenging.

In a sense, we developed a fingerprint not just for the user, but also for the websites they were accessing – what kind of anti-bot defenses do they use? To what factors and parameters are their defenses sensitive?

We then developed massive training datasets with synthetic fingerprinting data and used it to train the AI engine to create a match between a convincing, legitimate fingerprint that is well-suited to access the fingerprint of the website which is being requested.

In this way, the AI Engine is essentially monitoring the fingerprints of both sides of the process continuously, and adapting to new challenges in real-time.

The Challenge: Handling a Universe of Data

Navigating through the massive and ever-growing public web is no small task, with incredible diversity and unpredictable edge cases. However, our AI system is up to the challenge and optimizes itself in real-time to ensure that users consistently receive data and pass all tests.

It’s a mammoth task, but our AI is designed to handle this exact demand by constantly feeding and training on new data.

Real-World Applications: The Nimble Browser Advantage

Nimble Browser is already at work in the real world, helping users overcome anti-bot defenses and maintain a smooth and consistent flow of invaluable data. Some use cases include:

  • E-commerce: dynamic pricing algorithms made possible by smooth and accurate data obtained with Nimble Browser.
  • Marketing: Nimble Browser ensures consistent access to search engines, enabling data-driven SEO and SEM analysis.
  • Brand Protection: detecting brand infringements at scale across a variety of hubs and automating search engines to discover infringements using Nimble Browser.

Nimble Browser is most effectively utilized through Nimble APIs – our end-to-end data collection solution. Nimble APIs automate the entire process of accessing, structuring, and delivering web data, and are operated easily through a straightforward REST API.

Nimble APIs combine Nimble IP – our premium proxy network – with Nimble Browser to achieve effortless web data collection in a unified and simplified platform.

Nimble Browser in Action: a Practical Demonstration

To see Nimble Browser in action, let’s perform a real-world data collection request through Nimble API.

In the below example, we use the Nimble Web API to access E-commerce data. Behind the scenes, the Web API uses Nimble Browser to access the data by leveraging the AI Fingerprinting engine to imitate genuine user fingerprints and navigate through the site without triggering any defensive mechanisms.

curl -X POST '<>' \\
--header 'Authorization: Basic <credential string>' \\
--header 'Content-Type: application/json' \\
--data-raw '{
    "url": "<>",
    "country": "US",
		"method": "GET",
		"parse": true,
			       "selector": "#a-autoid-12-announce",

The above example also incorporates several of the Web API’s incredible features, including NLP-powered parsing, multi-level geolocation targeting, and page interactions.

  • NLP Parsing: Nimble APIs use sophisticated Natural Language Processing to “read” webpages in the same way humans would, and can structure web data into machine-readable formats including JSON and CSV.
  • Geolocation targeting: When a country is specified, Nimble APIs use Nimble IP to obtain a premium residential proxy in that target country. Additional mechanisms are also used on a per-site basis including locale, top-level domains, and custom locations.
  • Page Interactions: Nimble APIs allow for custom, javascript-like interactions to be run on the page before data is collected. In the above example, we use the wait and click feature to click on a variant of the requested product, then wait three seconds before collecting the data.

Learn more about Nimble APIs’ web data collection capabilities by visiting our docs.

Nimble Lab: Where Innovation Meets Rigorous Testing

At Nimble, we don’t rest on our laurels. Our Nimble Lab team is constantly testing all anti-bot mechanisms against a variety of challenges, and we use a deep learning approach to respond whenever a block is detected.

We are particularly proud of our real-time browser update system. With a keen eye on the ever-evolving web environment, we emphasize swift browser compilation, allowing us to respond rapidly to new challenges.

The Vision: Effortless Web Data Collection

We believe that the Nimble Browser’s AI breakthrough is more than just technological innovation. Our vision is to democratize public web data, making it cost-effective and accessible to any business at any scale.

Delivering real, tangible value to our customers is our primary goal, and we believe our AI-powered browser empowers our users to navigate the complex landscape of the web, scraping data swiftly, effectively, and with minimal obstacles.

This vision drives us, fuels our innovation, and ensures we remain at the forefront of web scraping technology. We invite you to join us on this exciting journey and experience the power of the Nimble browser first-hand.

Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.