Web Scraping with Python: Extracting AJAX-loaded Content with Selenium and BeautifulSoup

Web Scraping with Python: Extracting AJAX-loaded Content with Selenium and BeautifulSoup

In the world of modern web scraping, websites often rely on AJAX (Asynchronous JavaScript and XML) to load content dynamically. These websites don’t load everything at once; instead, they fetch additional content in the background, typically using JavaScript, which makes scraping more complicated. Fortunately, by combining Selenium and BeautifulSoup, we can scrape data from AJAX-loaded pages just like we would with static HTML. In this article, we’ll explore how to scrape AJAX-loaded content using this powerful duo.

Step 1: Install Required Libraries

Before we dive into scraping, we need to install the necessary Python libraries. We'll use Selenium for automating browser interaction and BeautifulSoup to parse the HTML content.

To install the required libraries, run:

pip install selenium beautifulsoup4

You’ll also need ChromeDriver or another driver for Selenium. You can download ChromeDriver from here, matching it with your Chrome version.

Step 2: Set Up Selenium for Web Scraping

Let’s start by setting up Selenium to interact with the page. In this example, we'll use Selenium to load a page and allow JavaScript (AJAX) to render the dynamic content.

Here’s an example script:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

# Set up the Selenium WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Visit the page with AJAX-loaded content
url = "https://example.com/ajax-page"
driver.get(url)

# Wait for the AJAX content to load (adjust time as needed)
driver.implicitly_wait(10)  # Wait for 10 seconds for the page to load

# Now, we can get the page source
page_source = driver.page_source

# Pass the page source to BeautifulSoup for parsing
soup = BeautifulSoup(page_source, "html.parser")

# Find the element containing the AJAX-loaded content
content = soup.find("div", class_="ajax-content")
print(content.text)

# Close the driver
driver.quit()

In this example, Selenium loads the page, waits for the content to be fully rendered by JavaScript, and then passes the page’s HTML to BeautifulSoup for easy parsing.

Step 3: Handle Infinite Scrolling

One common AJAX feature is infinite scrolling, where more content is loaded as the user scrolls down the page. To handle infinite scrolling with Selenium, we need to simulate scrolling down the page until all content has loaded.

Here’s how you can automate this:

import time

# Scroll down the page to trigger infinite scroll
for _ in range(5):  # Scroll 5 times
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(3)  # Wait for new content to load

# Now scrape the content
page_source = driver.page_source
soup = BeautifulSoup(page_source, "html.parser")

# Extract the content
items = soup.find_all("div", class_="item")
for item in items:
    print(item.text)

This script scrolls down the page five times, waits for new content to load, and then extracts the data.

Step 4: Scraping Data from Multiple AJAX Requests

Sometimes, the content on a page is loaded through multiple AJAX requests. Instead of relying solely on the page’s final HTML, you might want to capture the requests made by the browser to the server.

You can do this by inspecting the network requests in your browser’s developer tools to identify the URLs from which the data is being fetched. These requests usually return JSON or XML data, which can be scraped directly.

Here’s an example of how you can capture AJAX requests and parse the JSON response:

import requests

# Identify the URL of the AJAX request (you can find this in the browser’s network tab)
url = "https://example.com/api/data"

# Send a GET request to fetch the data
response = requests.get(url)
data = response.json()

# Process the JSON data
for item in data["results"]:
    print(item["title"])

In this case, we use requests to fetch data from an AJAX endpoint directly, bypassing the need for Selenium altogether.

Step 5: Save the Scraped Data

Once you've successfully scraped the content, you may want to save the data to a file. Here's how you can save your scraped data to a CSV file:

import csv

# Sample data
scraped_data = [["Title 1", "Description 1"], ["Title 2", "Description 2"]]

# Save to CSV
with open("scraped_data.csv", mode="w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(scraped_data)

This approach lets you save your scraped data in a structured format, such as CSV, for further analysis.

✅ Pros of Using Selenium for Scraping AJAX Content

🧠 Handles Dynamic Content: Selenium allows you to interact with pages that rely on AJAX for dynamic content loading, something BeautifulSoup alone cannot handle.
⚡ Full Browser Control: Selenium enables full browser automation, including interactions with buttons, form submissions, and scrolling.
🚀 Can Be Combined with Other Tools: Use Selenium alongside other libraries like BeautifulSoup or requests to scrape dynamic and static content.

⚠️ Cons of Using Selenium for Scraping AJAX Content

🐢 Slower than Static Scraping: Since Selenium runs a real browser, it is slower than traditional scraping methods like requests and BeautifulSoup.
❌ Requires Browser Automation: Setting up Selenium involves browser driver management and can be overkill for simple static scraping tasks.
💻 Resource-Heavy: Running a real browser consumes more system resources compared to simpler scraping methods.

Summary

Scraping AJAX-loaded content can be challenging, but using Selenium and BeautifulSoup together gives you the power to interact with dynamic content. Whether it’s handling infinite scrolling or extracting data from AJAX requests, these tools provide the flexibility you need to scrape modern websites efficiently.

For a much more extensive guide on web scraping with Python, including handling complex scenarios like authentication, pagination, and scraping at scale, check out my full 23-page PDF on Gumroad. It's available for just $10:

Mastering Web Scraping with Python Like a Pro.

If this was helpful, you can support me here: Buy Me a Coffee ☕

Web Scraping with Python: Extracting AJAX-loaded Content with Selenium and BeautifulSoup

Step 1: Install Required Libraries

Step 2: Set Up Selenium for Web Scraping

Step 3: Handle Infinite Scrolling

Step 4: Scraping Data from Multiple AJAX Requests

Step 5: Save the Scraped Data

✅ Pros of Using Selenium for Scraping AJAX Content

⚠️ Cons of Using Selenium for Scraping AJAX Content

Summary

Comments (0)

Read More

#reading

#popular

Web Scraping with Python: Extracting AJAX-loaded Content with Selenium and BeautifulSoup

Step 1: Install Required Libraries

Step 2: Set Up Selenium for Web Scraping

Step 3: Handle Infinite Scrolling

Step 4: Scraping Data from Multiple AJAX Requests

Step 5: Save the Scraped Data

✅ Pros of Using Selenium for Scraping AJAX Content

⚠️ Cons of Using Selenium for Scraping AJAX Content

Summary

Comments (0)

Read More

⚛️ Build a Simple Todo App with React Store - a Tiny React State Manager

How to manage large env files?

Top 8 Open-Source Tools for Web Application Development

Encrypted Chat Application with web option

#reading

#popular