Custom Web Scraper With Ruby

So, as with most things, I thought this was going to be more difficult than anticipated.

Really - you can sit down and build your own web scraper in less than an hour with Ruby.

Plus, it's interesting and makes you feel like this:

Ruby Wizard — "I am most definitely a wizard."

Intro

Let's say I have a business and want to gather data for upcoming events. The data on websites is populated in different manners and can sometimes be deeply nested. Grabbing that first h4 element on every page isn't going to always return a nice little event title.

It would be nice if you could just grab "Event name" from your chosen websites and extract that pertinent data. Well, you sort of can with some work. I'm going to show you a simple down and dirty method that works for many websites.

This assumes you have some experience with Ruby, or simple understanding of request/response cycle so you can follow along.

In this follow-along we will:

Create a simple Ruby project
Understand and implement the Faraday gem
Get custom results for slightly more complicated websites with the Nokogiri gem.
Output our hard work to the terminal - yay!

Let's Begin!

I use VSCode and my screenshots will be from there is you want to follow.

Make sure you have Ruby installed by running ruby -v in the terminal. If you need to install Ruby check out the docs.

Make a folder called ruby_scraper and open it in your editor. Assuming you already have Ruby installed, run bundle init to create your Gemfile.
With Gemfile created, next make your ruby file. For simplicity I'll call it scraper.rb

You should have one folder, ruby_scraper with your two files inside.

Install Faraday

According to the Faraday Docs:

"Faraday gives you the power of Rack middleware for manipulating HTTP requests and responses, making it easier to build sophisticated API clients or web service libraries that abstract away the details of how HTTP requests are made."

...cleaner, faster, and easily understood.

Sounds legit. Add Faraday to your Gemfile:

gem 'faraday', '~> 2.0'

You know, or whatever version we are rolling with upon your reading of this. As of now it's 2.0-ish.

After adding the gem in your Gemfile, head to the terminal and run our favorite:

bundle install

For performance, you may want to install Faraday-Gzip, which allows faster transmissions and stuff for the performance-conscious.

In your Gemfile add:

gem 'faraday-gzip', '~> 3'

In the terminal:

bundle install

Pause - If you want to create your response with JSON you can read the docs here about how that may need to look. But I prefer using the following Nokogiri gem for more serious digging.

Let's also install the Nokogiri gem, which allows you to more exactly pick data out of a web page - using CSS!

In your Gemfile add:

gem 'nokogiri', '~> 1.18'

Then in terminal:

bundle install

Super exciting. Yes - you can grab that exact piece of data from a webpage using a css class name or id! This is SO useful. Many websites may not have an API for you to navigate. You have to make your own jerry-rigged pathway. Setting up my scraper I wondered: "What if I want a specific deeply-nested div from a page?" Nokogiri is your answer, assuming it has a class or id of some sort.

If you are navigating a classless website, well - probably just poke around with JSON. Most of what you will find today is going to have class names. Moving on...

Check your Gemfile

I had to install some additional gems to run bundler properly. Here's mine:

Gemfile

Check Folder Structure

The items highlighted in green are important, ignore the readme and png - they are not important here. You just want to make sure that Gemfile, Gemfile.lock (If new to Ruby - Do not mess with Gemfile.lock, it auto-configures with your bundling!) and scraper.rb are directly in the folder, no sub-folders.

Folder structure

Ruby file setup

Pull in your required gems by adding them to the very top of your scraper.rb file.

Require

Lets setup our Faraday connection next. We need to know our base URL, our content type, create a user-agent name. Were gonna add a a little snippet at the end for gzip compression as well.

conn = Faraday.new(
  url: 'https://chicagoevents.com',
  headers: { 
    'Content-Type' => 'application/json',
    'User-Agent' => 'MyScraper' # Agent can be whatever
  }
) do |f|
  f.request :gzip # Enable gzip compression for requests
end

URL - https://chicagoevents.com - don't include a complicated path here. Just Base URL.

Content-Type will be application/json as Faraday will parse this.

User-Agent can be a name you create.

After that, we are going to make a connection to the specific web page we want:

res = conn.get('https://chicagoevents.com/vendors-and-artists/')

Then we wanna run the response through the Nokokiri gem so we can access items with CSS. Save it to a new variable so we can access it later.

doc = Nokogiri::HTML(res.body)

So far we have:

Response

Picking out data

Let's take our doc variable, and essentially activate out ability to navigate the CSS on the page from a certain point. You will have to inspect your website of choice, and see where to start to grab all the data you want.

Inspect

I see that all the data I want starts in places where the css class is .tribe-events-pro-photo__event-details. I set that as my CSS area of interest as such:

articles = doc.css('.tribe-events-pro-photo__event-details')

Implement the loop

Now that I have my hot little hands on the interesting part, I can grab each one using a loop.

articles.each do |article|
# some code stuff
end

I loop over each instance of this class occurrence, picking out items with classes or tags. In our loop we use article.at_css() method to make our selections. It's important to clean up or response to remove annoying whitespaces.

Let's get the event title. I see on my inspect that an h3 tag contains the titles. So I create a new variable, grab the h3 element, display and clean it up like in ruby.rb:

articles.each do |article|
  title = article.at_css('h3')&.text&.strip
  puts "Event: #{title}"
end

Test

Run it using bundle exec ruby scraper.rb. It should be something like this:

Scraped event titles

Cater to your needs

Once you've got it working, you can easily add whatever your heart desires.

Here is an example of the full custom scraper:

Custom Ruby Scraper

Then run bundle exec ruby scraper.rb again. Results output will be something like:

Output results

I highly recommend adding a line between each result to make your life visually easier.

Beast mode activate, pretty much.

Intro

Let's Begin!

Install Faraday

Check your Gemfile

Check Folder Structure

Ruby file setup

Picking out data

Implement the loop

Cater to your needs

Comments (0)

Read More

#reading

#popular

Custom Web Scraper With Ruby

Intro

Let's Begin!

Install Faraday

Check your Gemfile

Check Folder Structure

Ruby file setup

Picking out data

Implement the loop

Cater to your needs

Comments (0)

Read More

JWE vs JWT — Side-by-Side for Developers

⚛️ Build a Simple Todo App with React Store - a Tiny React State Manager

How to manage large env files?

Top 8 Open-Source Tools for Web Application Development

#reading

#popular