The Lab #4: Scrapyd Unleashed: The Art of Managing Multiple Web Scrapers



Pros and cons of actual scheduling solutions for Scrapy


Web scraping is like eating cherries: one website pulls the other and you will soon find yourself with hundreds of sparse scrapers in your servers, scheduled via crontab randomly.

In this post, we’ll see how to handle this complexity with some tools that use the Scrapy embedded functions to create a web dashboard for monitoring and scheduling your scrapers.

Why Scrapy?

It’s much easier to manage the Scrapy spiders because they have already bundled inside a telnet connection and a console that allows external software to query the scraper status and report it in the web dashboards.

From the Telnet console, you can basically start, pause and stop scrapers and monitor statistics about the Scrapy engine and the data collection.

All you need to do is to log in via telnet to the address of the machine where Scrapy is running, using the username and password provided in the settings file.

These are the default values in setting.py file when a scraper is created:

TELNETCONSOLE_ENABLED = 1
TELNETCONSOLE_PORT = [6023, 6073]
TELNETCONSOLE_HOST = '127.0.0.1'
TELNETCONSOLE_USERNAME = 'scrapy'
TELNETCONSOLE_PASSWORD = None

In case the password is not set, Scrapy will automatically generate a random one at the beginning on the execution of the scraper. You should see during the start-up a line like the following:

[scrapy.extensions.telnet] INFO: Telnet Password: fe53708491f51304

Once you connect to the console, you can retrieve your scraper stats with the command

stats.get_stats()

and get the stats you usually get at the end of the execution

{'log_count/INFO': 10, 
'start_time': datetime.datetime(2022, 10, 11, 18, 19, 8, 178404), 'memusage/startup': 56901632, 
'memusage/max': 56901632, 
'scheduler/enqueued/memory': 43, 
'scheduler/enqueued': 43, 
'scheduler/dequeued/memory': 13, 
'scheduler/dequeued': 13, 
'downloader/request_count': 14, 
'downloader/request_method_count/GET': 14, 
'downloader/request_bytes': 11016, 
'robotstxt/request_count': 1, 
'downloader/response_count': 9, 
'downloader/response_status_count/404': 2, 
'downloader/response_bytes': 30301, 
'httpcompression/response_bytes': 199875, 'httpcompression/response_count': 9, 
'log_count/DEBUG': 14, 
'response_received_count': 9, 
'robotstxt/response_count': 1, 
'robotstxt/response_status_count/404': 1, 'downloader/response_status_count/200': 7, 
'request_depth_max': 2, 
'item_scraped_count': 5, 
'httperror/response_ignored_count': 1, 'httperror/response_ignored_status_count/404': 1}

or the status of the Scrapy engine and control its execution by pausing or terminating it, using the following commands:

engine.pause()
engine.unpause()
engine.stop()

Now we have understood how to gather stats and command the scraper from remote, we could certainly create some scripts to gather all these pieces of information, but there’s no need to reinvent the wheel, as there are already several open-source solutions that can help us.


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.


Scrapyd

Scrapyd is an application that schedules and monitors Scrapy spiders, with also a (very) basic web interface. It can be used also as a versioning tool for the scrapers since it allows the creation of multiple versions for the same scraper, even if only the latest one can be launched.

The full article is available only to paying users of the newsletter.
You can read this and other The Lab paid articles after subscribing


Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.



Liked the article? Subscribe for free to The Web Scraping Club to receive twice a week a new one in your inbox.