Hello everyone, this is my first article here, so I hope you like it!

If you have been doing Web Scraping for a long time, you probably noticed that there are repeating problems with Web Scraping, like:

  1. Rapidly changing website structures — Sites frequently update their DOM structures, breaking static XPath/CSS selectors.
  2. Unstable selectors — Class names and IDs often change or use randomly generated values that break scrapers or make scraping these websites difficult.
  3. Increasingly complex anti-bot measures — CAPTCHA systems, browser fingerprinting, and behavior analysis make traditional scraping difficult and others

But that's only if you are doing targeted Web Scraping for known websites, in which case you can write specific code for every website.

If you start thinking about bigger goals like Broad Scraping or Generic Web Scraping, or what you like to call it, then the above issues intensify, and you will face new issues like:

  1. Extreme Website Diversity — Generic scraping must handle countless variations in HTML structures, CSS usage, JavaScript frameworks, and backend technologies.
  2. Identifying Relevant Data — How does the scraper know what data is important on a page it has never seen before?
  3. Pagination variations — Infinite scroll, traditional pagination, "load more" buttons all requiring different approaches and more

How are you going to solve that manually? I'm talking about generic web scraping of different websites that don't share any technologies.

AI to the rescue

Recently, there's been a noticeable shift toward AI-based web scraping, driven by its potential to address these challenges.

Of course, the AI can solve most of these issues easily because it will understand the page source, tell you where the fields you want are, or create selectors for them for you. That's, of course, if you already solved the anti-bot measures through other tools 😄

This approach is beautiful, of course. I love AI and find it very interesting to keep learning about it, especially GenAI. You will probably spend a lot of time on prompt engineering and tweaking the prompts, but if that's cool with you, you will soon hit the real issue with using AI here.

Most websites have huge content per page, which you will need to pass to the AI somehow so it can do its magic. This will burn through tokens like fire in a haystack, quickly building up high costs!

Unless money is irrelevant to you, you will try to find cheaper approaches, and that's why I made Scrapling 😄

Scrapling got you covered

After years of working in Web Scraping and scraping hundreds or thousands of websites manually with Python spiders, I got tired of maintaining spiders and the same repeating issues we all deal with in this field.

So, 8 months ago, I decided to take the first step and camped in my house for ~50 days, not doing anything in my life other than finishing my Web Scraping job, then working on Scrapling for the rest of the day. Both my job and Scrapling were taking between 8-14 hours daily, rewrote the first version more than 5 times to get better performance and an easier API, but in the end, it was worth it as Scrapling version 0.1 was born, and now we are at version 0.2.99 while writing this 😄

Scrapling is an Undetectable, high-performance, intelligent Web Scraping library for Python to make Web Scraping easy and effortless as it should be, or should I say as it was? The goal is to provide powerful features while maintaining simplicity and minimal boilerplate code.

Scrapling can deal with almost all issues you will face during Web Scraping, and the following updates will cover the rest carefully.

Also, did I say that Scrapling's parsing engine is faster than BeautifulSoup 400-600 times in benchmarks, while having more features, uses less memory, and has a very similar API? Oh, I think I just did, but that's a subject for another time :laugh:

Below, we will talk about how to install it, then we will talk about how the issues above are solved by Scrapling

Installation

Scrapling is a breeze to get started with! Starting from version 0.2.9, we require at least Python 3.9 to work.

Run this command to install it with Python's pip.

pip3 install scrapling

You are ready if you plan to use the parser only (the Adaptor class).

But if you are going to make requests or fetch pages with Scrapling, then run this command to install the browsers' dependencies needed to use the Fetchers

scrapling install

Solving issue T1: Rapidly changing website structures

One of Scrapling's most powerful features is Automatch. It allows your scraper to survive website changes by intelligently tracking and relocating elements.

While Web Scraping, if you have automatch enabled, you can save any element's unique location properties to find it again later if the website's structure changes. The most frustrating thing about changes is that anything about an element can change, so there's nothing to rely on.

That's how the automatch feature works: it stores everything unique about an element's location in the DOM. When the website structure changes, it returns the element with the highest similarity score with the saved properties.

I will give you two examples of how to use it to hammer the idea.

Let's say you are scraping a page with a structure like this:

class="container">
     class="products">
         class="product" id="p1">
            Product 1
             class="description">Description 1
        
         class="product" id="p2">
            Product 2
             class="description">Description 2

And you want to scrape the first product, the one with the p1 ID. You will probably write a selector like this

page.css('#p1')

When website owners implement structural changes like

class="new-container">
     class="product-wrapper">
         class="products">
             class="product new-class" data-id="p1">
                 class="product-info">
                    Product 1
                     class="new-description">Description 1
                
            
             class="product new-class" data-id="p2">
                 class="product-info">
                    Product 2
                     class="new-description">Description 2

The selector will no longer function, and your code needs maintenance. That's where Scrapling's auto-matching feature comes into play.

With Scrapling, you can enable the automatch feature the first time you select an element, and the next time you select that element and it doesn't exist, Scrapling will remember its properties and search on the website for the element with the highest percentage of similarity to that element and without AI 😄

from scrapling import Adaptor, Fetcher
# Before the change
page = Adaptor(page_source, auto_match=True, url='example.com')
# or
Fetcher.auto_match = True
page = Fetcher.get('https://example.com')
# then
element = page.css('#p1' auto_save=True)
if not element:  # One day website changes?
    element = page.css('#p1', auto_match=True)  # Scrapling still finds it!
# the rest of your code...

Let's show you a Real-World Scenario. But wait, to do this, we need to find a website that will soon change its design/structure, take a copy of its source, and then wait for the website to make the change. Of course, that's nearly impossible to know unless I know the website's owner, but that will make it a staged test, haha.

To solve this issue, I will use The Web Archive's Wayback Machine. Here is a copy of StackOverFlow's website in 2010; pretty old, eh?

Let's test if the automatch feature can extract the same button in the old design from 2010 and the current design using the same selector 😄

If I want to extract the Questions button from the old design, I can use a selector like this: #hmenus > div:nth-child(1) > ul > li:nth-child(1) > a This selector is too specific because it was generated by Google Chrome.

Now, let's test the same selector in both versions

>> from scrapling import Fetcher
>> selector = '#hmenus > div:nth-child(1) > ul > li:nth-child(1) > a'
>> old_url = "https://web.archive.org/web/20100102003420/http://stackoverflow.com/"
>> new_url = "https://stackoverflow.com/"
>> Fetcher.configure(auto_match = True, automatch_domain='stackoverflow.com')
>> 
>> page = Fetcher.get(old_url, timeout=30)
>> element1 = page.css_first(selector, auto_save=True)
>> 
>> # Same selector but used in the updated website
>> page = Fetcher.get(new_url)
>> element2 = page.css_first(selector, auto_match=True)
>> 
>> if element1.text == element2.text:
...    print('Scrapling found the same element in the old and new designs!')
'Scrapling found the same element in the old and new designs!'

Note that I used a new argument called automatch_domain; this is because, for Scrapling, these are two different domains(archive.org and stackoverflow.com), so scrapling will isolate their auto_match data. To tell Scrapling they are the same website, we need to pass the custom domain we want to use while saving auto-match data for them both, so Scrapling doesn't isolate them.

The code will be the same in a real-world scenario, except it will use the same URL for both requests, so you won't need to use the automatch_domain argument. This is the closest example I can give to real-world cases, so I hope it didn't confuse you 😄

The rest of the details are on the automatch page on Scrapling's documentation.

Solving issue T2: Unstable selectors

If you have been doing Web scraping for a long enough time, you have likely experienced this once. I'm talking about a website that uses poor design patterns, is built on pure html without any IDs/classes, uses random class names that change a lot with no identifiers or attributes to rely on, and the list goes on!

In these cases, standard selection methods with CSS/XPath selectors won't be optimal, and that's why Scrapling provides 3 more methods for Selection:

  1. Selection by element content - Through text content (find_by_text) or regex that match a text content (find_by_regex)
  2. Selecting elements similar to another element - You find an element, and we will do the rest!
  3. Selecting elements by filters - You just specify conditions that this element must fulfill!

These selection methods need separate articles, so there is no need to explain any of these here but just click on the links from the documentation above, and take a deep dive!

Solving issue T3: Increasingly complex anti-bot measures

It's known that making an undetectable spider takes more than residential/mobile proxies and human-like behavior. It also needs a hard-to-detect browser, which Scrapling provides two main options to solve:

  1. PlayWrightFetcher — This fetcher provides not only stealth mode suitable for small-medium protections but also more flexible options, like using your real browser.
  2. StealthyFetcher — Because we live in a harsh world and you need to take full measure instead of half measures, StealthyFetcher was born. This fetcher uses a modified Firefox browser called Camoufox that almost passes all known tests and adds more tricks on top of it.

The links will redirect you to the documentation of these two classes. Both classes will be improved a lot with the upcoming updates, so stay tuned 😄

Solving issues B1 & B2: Extreme Website Diversity / Identifying Relevant Data

This one is tough to handle, but it's possible with Scrapling's flexibility.

I talked with someone who uses AI to extract prices from different websites. He is only interested in prices and titles, so he uses AI to find the price for him.

I told him you don't need to use AI here and gave this code as an example

price_element = page.find_by_regex(r'£[\d\.,]+', first_match=True)  # Get the first element that contains a text that matches price regex eg. £10.50
# If you want the container/element that contains the price element
price_element_container = price_element.parent or price_element.find_ancestor(lambda ancestor: ancestor.has_class('product'))  # or other methods...
target_element_selector = price_element_container.generate_css_selector or price_element_container.generate_full_css_selector # or xpath

Then he said What about cases like this:

class='currency'> $   class='a-price'> 45,000

So, I updated the code like this

price_element_container = page.find_by_regex(r'[\d,]+', first_match=True).parent # Adjusted the regex for this example
full_price_data = price_element_container.get_all_text(strip=True)  # Returns '$45,000' in this case

This was enough for his use case. You can use the first regex, and if it doesn't find anything, use the following regex, and so on. Try to cover the most common patterns first, then the lesser common ones, and so on.
It will be a bit boring, but it's definitely cheaper than AI.

This example demonstrates the idea I wanted to deliver here. Not every challenge will need AI only to be solved, but sometimes you need to be creative, and that might save you a lot of money :)

Solving issue B3: Pagination variations

This issue Scrapling currently doesn't have a direct method to automatically extract pagination's URLs for you, but it will be added with the following updates 😄

But you can handle most websites if you search for the most common patterns with page.find_by_text('Next').attrib['href'] or page.find_by_text('load more').attrib['href'] or selectors like "a[href*="?page="]"" or "a[href*="/page/"]""—you get the idea.

Cost Comparison and Savings

For a quick comparison.

Aspect Scrapling AI-Based Tools (e.g., Browse AI, Oxylabs)
Cost Structure Likely free or low-cost, no per-use fees Starts at $19/month (Browse AI) to $49/month (Oxylabs), scales with usage
Setup Effort Requires technical expertise, manual setup Often no-code, easier for non-technical users
Scalability Depends on user implementation Built-in support for large-scale, managed services
Adaptability High with features like automatch High, automatic with AI, but costly for frequent changes

This table is based on pricing from Browse AI Pricing and Oxylabs Web Scraper API Pricing

Conclusion

No challenges remain challenging for a long time, but it depends on how you look at it and how you are going to solve it. Will you go for the maybe-easier but expensive solution, or will you decide to stay longer with the challenge till you find a better solution? It all depends on you. I always like to look at how DeepSeek initially defeated OpenAI with fewer resources by thinking of more efficient solutions. Sometimes, it's like that :laugh:

In the end, nothing is perfect. So, if you find an issue in Scrapling, please don't hesitate to report it, since Scrapling is under heavy development (The next update is going to be insane, you'd better join our discord server to try beta features before anyone else!)

I hope you like the article, and please let me know if you have any feedback!

Disclaimer: This article is an improved and expanded version of the original article written by me here