Building a Web Scraper with Ruby: Cats + Ruby + CLI + Emails

Interested in writing a webpage scraper in Ruby? Or maybe a regular CLI app? This article is about using OOP in Ruby to create a CLI app that regularly checks a webpage for updates and sends emails when updates are found. I'll mostly talk about the gems used, decisions made, and the challenges I faced during development.

Do you like cats? I do. In December 2020, I wanted to adopt one. In Montréal, one can adopt pets from the SPCA. Usually, pets get adopted very fast! Thus, I felt the need for an app which would notify me of new arrivals.

Also, having worked with Ruby throughout 2020, I had realized that I love writing Ruby code. Having lost my Ruby job, I had been hungry for writing Ruby for over a month 🤤. Thus, I decided to do a Ruby coding marathon to celebrate my birthday! I decided to write a small CLI app to watch the pet adoption website and to send me emails when new cats are available 🐈.

This article does not aim to promote my app, but to share the underlying code and findings with Ruby enthusiasts who aim to write similar apps.

Two minute version

I started by creating a list of features for my app:
- Watch the SPCA website for new items at regular intervals.
- Send email notifications when new items are found.
Used Thor to support CLI parameters.
- --interval sets the watch interval.
- --category sets pet category.
- --email enables email notifications.
- --verbose enables verbose output.
See the SPCA::Cli#scan method for further details.
The app tries to keep code organized into logical classes.
The app has a reasonable number of tests.
- I would’ve added more tests if this were a real, commercial app.
- I’d love to add some integration tests.
Explore the source code for the app and enjoy.

A black and white — Io was my neighbour in Colombia, named after Jupiter’s moon.

Specifications

Working on a project is more fun with a checklist of features and a deadline, so I started with some project management.

Deadline

12 hours to write an MVP.
12 hours to make refinements.

Features

Watch the SPCA website for new pets at intervals of 5 minutes.
Send email notifications to configured email addresses.
- Preferably, include images in the email body.
Allow choosing a category to watch, i.e. dogs, cats, etc.
Test the code well because coding is less fun without tests 🤷🏽‍♂️.

Here’s how the app looks like in action.

$ ruby spcas.rb --verbose --email --interval=5

2 new item(s) found.
======
Lavande | Cat • Senior • Female
https://www.spca.com/en/animal/lavande
------
Meridyth | Cat • Kitten • Female
https://www.spca.com/en/animal/meridyth

Docker environment

Reference: .lando.yml

To kick it off, I created a Docker based dev env with Lando, using a community-maintained ruby image. I like Docker because it helps me keep my laptop clean. I’ve been using Lando for Drupal projects for a while, so I decided to use it here as well.

Dependencies

Reference: Gemfile

My goal was to follow the KISS principle as strictly as possible. Here are some noteworthy dependencies.

Nokogiri

If you’re dealing with parsing XML/HTML, chances are you’ll run into Nokogiri. It lets you deal with the DOM with ease. I used this for parsing HTML data fetched from the remote website and to convert them into domain objects.

Mail

Reference: spca/mail.rb

You guessed it! The mail gem lets you send emails. My initial plan was to use sendmail for sending emails, but I ended up using SMTP to improve deliverability.

Thor

Reference: spca/cli.rb

Since my CLI app allows the user to pass in parameters like --category, I ended up using Thor. Thor is a toolkit that makes it easy to build CLI apps. I was introduced to Thor while working on SiteDiff and I thought it’d be a good fit for this project as well.

Others

Here are some other noteworthy mentions:

Rubocop: Lints the code and helps enforce coding standards.
Rspec: Helps with writing tests. Lately I’ve realized that I like minitest more.
Climate Control: It helps you manipulate environment variables while running tests, thus, the name climate control.
Dotenv: Loads environment variables form .env files. I chose to put SMTP configuration parameters in a .env file, which is not version controlled.
Pry: For debugging.

Project structure

For small Ruby projects like this one, there are no strict rules for project structure. Having looked at some Ruby projects, I chose to organize my code as follows:

.
├── app
│   ├── bootstrap.rb # loads dependencies
└── lib # libraries
│   ├── spca.rb # The SPCA module (namespace)
│   ├── spca
│   │   └── *.rb # SPCA::* classes
└── spec # contains tests
│   ├── fixtures
│   ├── spec_helper.rb # config and bootstrapping for tests
│   ├── *_spec.rb # other modules internal to Clibato
└── Gemfile
└── spcas.rb # app entrypoint

I’m aware that some things could’ve been slightly better, but I didn’t want to invest endless time on building an app that I’d probably use for a month.

How it works

One of the objectives of writing this code was to POOP (practice object-oriented programming). I wanted to have more Ruby code samples for Ruby developer interviews because I really loved (love?) Ruby. Thus, I tried to follow the SOLID principles and organized my code into classes.

Entrypoint

Reference: spcas.rb

The spcas.rb file in the root of the project acts as the entry point for this app. Simply execute this file with Ruby to run the app. This file includes app/bootstrap.rb which loads all dependencies – both SPCA::* classes and libraries installed with Bundler.

# ...
Bundler.require(:default, :development)
require_relative '../lib/spca'
# ...

I would’ve loved to put the bootstrap.rb in lib/spca. I choose to put it in app, because it is more of an app-level file that deals with more than just the SPCA libraries.

SPCA namespace

Reference: spca.rb

Next comes the lib/spca.rb file, which creates module SPCA that acts as a namespace for all the SPCA::* classes. A part of me wanted this SPCA to extend Thor and implement the CLI. However, this is implemented in SPCA::Cli for two reasons:

SPCA had to be a module to act as a namespace for all SPCA::* classes.
Thor can only be extended by classes, not modules. Thus, the SPCA::Cli class was created.

To kick the app off, SPCA::Cli.start is invoked.

Cli

Reference: spca/cli.rb

You can think of this class as the controller for the app. SPCA::Cli extends Thor, and all the action takes place in the #scan method, which is defined as the default action for the SPCA CLI.

module SPCA
  class Cli < Thor
    option :email, type: :boolean
    option :category, type: :string
    option :interval, type: :numeric, default: 0
    option :verbose, alias: ['v'], type: :boolean, default: false
    def scan
      # Hither lies the action.
    end

    default_command :scan
  end
end

Now let’s take a look at what goes on in SPCA::Cli#scan.

SPCA::Scanner scans the URL that contains pet information.
- SPCA::Cache is used to cache results.
- This helps filter out new items from old items.
If any new items, are found, they’re forwarded to two processors:
- Cli#send_puts(pets) shows the pet info on stdout.
- Cli#send_mail(pets) creates an Spca::Mail for the pets and sends out emails.
Apart from all this, there’s an infinite loop that keeps running at intervals of x minutes if the parameter --interval is present.

Thus, the app can be executed with several configurable parameters as follows:

$ ruby spcas.rb --interval=5 --verbose --email

Fetcher

Reference: spca/fetcher.rb

The fetcher, as the name suggests, takes a URL and fetches its contents.

module SPCA
  class Fetcher
    def initialize(cache); end
    def fetch(uri); end
  end
end

As you might’ve noticed, the fetcher takes a cache object, where it caches the response from the URL it fetches. Fetching a response from a remote URL is an expensive operation, so by default, the fetcher assumes that the response from a particular URL won’t change for 15 minutes.

Cache

Reference: spca/cache.rb

This is a basic implementation of a key-value cache storage that uses the file system.

module SPCA
  class Cache
    def get(key); end # Get value from cache.
    def set(key, data); end # Set value in cache.
    def remove(key); end # Remove value from cache.
    def exist?(key); end # Check if value exists in cache.
    def clear; end # Remove all items in cache.
  end
end

If I had more time, I would’ve created SPCA::Cache as an interface and then written one or more implementations at SPCA::Cache::FileSystemCache. But it didn’t make much sense to do all that for a weekend project.

Scanner

Reference: spca/scanner.rb

The fetcher is only responsible for fetching data from a URL. But someone has to interpret the fetched data and parse it into data that the app can use. That’s what the scanner does. Let’s take a look at SPCA::Scanner#execute().

Takes a pet category, i.e., SPCA::Category, which contains the URL on which pets in that category can be found.
Uses a Fetcher to fetch the HTML response from that URL.
The result returned by the Fetcher is converted into SPCA::PetCard objects containing title, image, and other info about pets.
If the fetched item is already in cache, it is ignored.
Items that are not already cached (i.e. new items), are cached and returned for further processing.

Mail

Reference: spca/mail.rb

The mail class was created to provide abstraction from 3^rdparty mailing libraries. Thus, the whole app won’t be affected if we change the mailing library in the future.

module SPCA
  class Mail
    def initialize(pets, mail: nil)
    def deliver()
  end
end

I would’ve created a Mail interface and implemented it in a PetFoundNotificationMail class to organize things further. However, it didn’t make sense for a weekend project. Besides, such an interface would’ve had only one implementation, so I said YAGNI.

Mail#initialize has a mail parameter? Yes, it does! That helps with unit-testing our mail class by letting us pass in a mock object.

An email message containing a picture of a senior cat — When new pets are found, a message is sent to configured email addresses.

Testing

In order for the code to be testable, the code must be written in a specific way! Thus, most of the code that does the heavy-lifting is in classes which can easily be unit-tested. Though most of the tests turned out to be straight forward, here are some that took a while to put in place.

File-system tests

Reference: spec/spca/cache_spec.rb

The app stores pet information in cache files. To avoid complexity, cache items are simply stored on the file system. Thus, cache_spec.rb makes use of the /tmp directory to perform caching tests.

module SPCA
  describe Cache do
    subject { Cache.new('/tmp') }

    before(:each) do
      subject.clear
    end

    # ...
  end
end

It has been assumed that this code won’t be run on Windows, so I’ve hard-coded path separators as slash characters.

HTTP request tests

Reference: spec/spca/fetcher_spec.rb

The Fetcher is responsible for fetching data from remote URLs using Ruby’s Net::HTTP class. While unit-testing, it can safely be assumed that Net::HTTP is well-tested by the Ruby team. Thus, I chose to mock it’s responses.

# Sample test for SPCA::Fetcher
it '.fetch gets data with HTTP request when URI is not cached' do
  response = MockNetHttpResponse.new('...')

  expect_any_instance_of(Net::HTTP)
    .to receive(:get)
    .with(@uri.request_uri)
    .and_return(response)

  expect(@fetcher.fetch(@uri)).to eq(response.body)
end

HTML parsing tests

Reference: pet_card_spec.rb

It is very tempting (and easy) to simply fetch a remote URL, take its HTML, and run tests on it. However, there are a few problems:

The remote site is not mine, so I avoid sending unnecessary requests to it.
The remote site might be down at some point, which will make my tests fail.
- Sending a real request could happen in an integration test though.
The remote site might be slow, which will slow down my test execution.

To bypass these issues, I stored a sample HTML response in a fixtures directory, which is later used for tests that deal with parsing that response.

it '.from_element creates a PetList' do
  el = Nokogiri::HTML.parse(
    File.open("#{SPCA::ROOT_PATH}/spec/fixtures/list.html")
  )
  list = PetList.from_element(el)

  expect(list).to be_a_kind_of(Array)
  expect(list.length).to be(2)

  list.each do |item|
    expect(item).to be_a_kind_of(SPCA::PetCard)
  end
end

Integration tests

If this were a real project for a real client (or if I had more time), I would’ve included an integration test which would fetch data with a real HTTP request and make sure that the app as a whole works correctly. However, I didn’t write one because the app is mainly controlled by SPCA::Cli, which depends on Thor, which I know to be well-tested.

Someday, I’d love to write a test for the SPCA::Cli class to make sure it handles all command-line parameters correctly.

Conclusion

Writing this app helped me use object-oriented programming to solve a real life problem. Also, writing these classes and tests them gave me immense joy. The Ruby ecosystem has many packages (gems) that make it very easy to write data scrapers.

Though I wrote this app to help me find a cat, I ended up not adopting one.

A part of me indeed loves working with Ruby. Unfortunately, most Ruby companies prefer hiring Ruby devs with years of experience, thereby dismissing candidates with lesser experience (like me) without even seeing their code 🤷🏽‍♂️.

Though I continue to love Ruby, I’ve had to make the difficult decision of switching to PHP for now. I’ll probably move to Python in the long-term if I decide to keep coding for the rest of my life 👨🏽‍💻.

Next steps

See the source code for the SPCA Scanner app.
Write a simple data-scraper/HTML-parser app in Ruby.
Read about building a CLI app in Python.
Thinking about getting a pet? Consider adopting one.

Building a Web Scraper with Ruby: Cats + Ruby + CLI + Emails

Two minute version

Specifications

Deadline

Features

Docker environment

Dependencies

Nokogiri

Mail

Thor

Others

Project structure

How it works

Entrypoint

SPCA namespace

Cli

Fetcher

Cache

Scanner

Mail

Testing

File-system tests

HTTP request tests

HTML parsing tests

Integration tests

Conclusion

Next steps

On this page

Migrating Custom Themes from Drupal 9 to Drupal 10: Removing Dependencies on Classy and Stable

Custom Drush Commands: Site-wide Drush Commands

Full Stack Drupal Developer at Symetris: Interview, Experience, and Review

Clibato – CLI Backup Tool: Building a Command-line Python Application

Dockerize Ruby on Rails with Lando

Dockerize Ruby on Rails with Docker Compose

Two minute version

Specifications

Deadline

Features

Docker environment

Dependencies

Nokogiri

Mail

Thor

Others

Project structure

How it works

Entrypoint

SPCA namespace

Cli

Fetcher

Cache

Scanner

Mail

Testing

File-system tests

HTTP request tests

HTML parsing tests

Integration tests

Conclusion

Next steps

On this page

Never miss an article

Related articles

Migrating Custom Themes from Drupal 9 to Drupal 10: Removing Dependencies on Classy and Stable

Custom Drush Commands: Site-wide Drush Commands

Full Stack Drupal Developer at Symetris: Interview, Experience, and Review

Clibato – CLI Backup Tool: Building a Command-line Python Application

Dockerize Ruby on Rails with Lando

Dockerize Ruby on Rails with Docker Compose