Web Scraping Automation with Scrapy and Selenium

Modern websites present a challenge for data extraction. Some load content immediately, while others wait for JavaScript to render everything. You need different tools for different situations. Scrapy handles static content fast. Selenium manages dynamic pages that need browser interaction. Together, they create a powerful scraping system.

This tutorial shows you how to build an automated scraping pipeline. You’ll start with a basic Scrapy spider, add Selenium for JavaScript-heavy pages, and combine both tools for maximum flexibility. The examples use real-world scenarios you can adapt to your projects.

Prerequisites

Before writing any scraping code, install the required packages. Create a new Python virtual environment to keep dependencies isolated.

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Scrapy, Selenium, and the WebDriver manager:

pip install scrapy selenium webdriver-manager

The webdriver-manager package automatically downloads and configures browser drivers. This saves you from manually installing ChromeDriver or GeckoDriver. For Chrome users, you’ll also need Chrome or Chromium installed on your system.

Verify your installation:

import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

print(f"Scrapy version: {scrapy.__version__}")
print("Selenium ready")

If this runs without errors, you’re set to build your first scraper.

Step 1: Your First Scrapy Spider

Scrapy excels at crawling static websites fast. It sends HTTP requests and parses HTML responses efficiently. A single spider can scrape thousands of pages in minutes.

Create a new Scrapy project:

scrapy startproject bookstore
cd bookstore

This generates a project structure with settings, middlewares, and a spiders folder. Navigate to bookstore/spiders/ and create books_spider.py:

import scrapy

class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['http://books.toscrape.com/']
    
    def parse(self, response):
        for book in response.css('article.product_pod'):
            yield {
                'title': book.css('h3 a::attr(title)').get(),
                'price': book.css('.price_color::text').get(),
                'availability': book.css('.availability::text').re(r'\w+')[0],
            }
        
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

This spider visits a book catalog, extracts titles, prices, and stock status, then follows pagination links. CSS selectors target specific HTML elements. The ::attr() pseudo-element grabs attributes, while ::text gets text content.

Run the spider and save results to JSON:

scrapy crawl books -o books.json

You’ll see log output showing requests and responses. Scrapy processes pages concurrently by default (16 simultaneous requests). The -o flag outputs data to a file in JSON format. You can also use CSV or XML.

Check your results:

cat books.json | python -m json.tool | head -20

This spider works because the target site serves complete HTML. No JavaScript execution required. But many modern sites don’t work this way.

Step 2: Handling Dynamic Content with Selenium

Social media feeds, infinite scrollers, and single-page applications load content via JavaScript. When you inspect the page source, you find empty containers. The data appears only after scripts execute in a browser.

Selenium automates a real browser. It waits for JavaScript to run, clicks buttons, fills forms, and extracts the rendered content. Here’s a standalone Selenium script that scrapes dynamic content:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
import time

# Configure Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--headless')  # Run without GUI
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

# Initialize driver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=options)

try:
    driver.get('https://example-spa.com/products')
    
    # Wait for JavaScript to load content
    wait = WebDriverWait(driver, 10)
    products = wait.until(
        EC.presence_of_all_elements_located((By.CLASS_NAME, 'product-card'))
    )
    
    # Scroll to trigger lazy loading
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    
    # Extract data
    for product in driver.find_elements(By.CLASS_NAME, 'product-card'):
        title = product.find_element(By.TAG_NAME, 'h2').text
        price = product.find_element(By.CLASS_NAME, 'price').text
        print(f"{title}: {price}")
        
finally:
    driver.quit()

This script runs Chrome in headless mode (no visible window). The WebDriverWait ensures elements exist before trying to extract them. Explicit waits prevent errors from timing issues.

Selenium handles interactions that Scrapy can’t:

Click “Load More” buttons
Fill and submit search forms
Wait for AJAX requests to complete
Handle infinite scroll
Navigate multi-page wizards

But Selenium has downsides. It’s slower than Scrapy because it loads full browsers. Memory usage is higher. Running many parallel browsers overwhelms most systems.

Step 3: Data Extraction and Storage

Both Scrapy and Selenium need structured data storage. Dumping everything to JSON works for small projects, but larger operations require databases.

Scrapy includes built-in Item classes for structured data. Define your data schema in items.py:

import scrapy

class ProductItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    url = scrapy.Field()
    description = scrapy.Field()
    stock = scrapy.Field()
    scraped_at = scrapy.Field()

Update your spider to use Items:

from datetime import datetime
from bookstore.items import ProductItem

class BooksSpider(scrapy.Spider):
    name = 'books'
    start_urls = ['http://books.toscrape.com/']
    
    def parse(self, response):
        for book in response.css('article.product_pod'):
            item = ProductItem()
            item['title'] = book.css('h3 a::attr(title)').get()
            item['price'] = book.css('.price_color::text').get()
            item['url'] = response.urljoin(book.css('h3 a::attr(href)').get())
            item['stock'] = book.css('.availability::text').re(r'\w+')[0]
            item['scraped_at'] = datetime.now().isoformat()
            yield item

For database storage, create a Scrapy pipeline. Edit pipelines.py:

import sqlite3
from datetime import datetime

class SQLitePipeline:
    def open_spider(self, spider):
        self.connection = sqlite3.connect('products.db')
        self.cursor = self.connection.cursor()
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS products (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                title TEXT,
                price TEXT,
                url TEXT UNIQUE,
                stock TEXT,
                scraped_at TEXT
            )
        ''')
        self.connection.commit()
    
    def close_spider(self, spider):
        self.connection.close()
    
    def process_item(self, item, spider):
        try:
            self.cursor.execute('''
                INSERT OR REPLACE INTO products (title, price, url, stock, scraped_at)
                VALUES (?, ?, ?, ?, ?)
            ''', (
                item.get('title'),
                item.get('price'),
                item.get('url'),
                item.get('stock'),
                item.get('scraped_at')
            ))
            self.connection.commit()
        except sqlite3.Error as e:
            spider.logger.error(f"Database error: {e}")
        return item

Enable the pipeline in settings.py:

ITEM_PIPELINES = {
    'bookstore.pipelines.SQLitePipeline': 300,
}

The number (300) sets processing order. Lower numbers run first. Multiple pipelines can clean, validate, and store data in sequence.

For Selenium scrapers, use the same database code directly:

import sqlite3
from datetime import datetime

def save_to_db(products):
    conn = sqlite3.connect('products.db')
    cursor = conn.cursor()
    
    for product in products:
        cursor.execute('''
            INSERT OR REPLACE INTO products (title, price, url, scraped_at)
            VALUES (?, ?, ?, ?)
        ''', (product['title'], product['price'], product['url'], datetime.now().isoformat()))
    
    conn.commit()
    conn.close()

SQLite works well for projects with under 100,000 records. For larger datasets, consider PostgreSQL or MongoDB.

Step 4: Combining Scrapy and Selenium

Many websites mix static and dynamic content. Product listings load server-side, but reviews appear via JavaScript. The homepage renders immediately, but search results need AJAX. You want Scrapy’s speed for static parts and Selenium’s power for dynamic sections.

The scrapy-selenium middleware integrates both tools. Install it:

pip install scrapy-selenium

Configure Scrapy to use Selenium in settings.py:

from shutil import which

SELENIUM_DRIVER_NAME = 'chrome'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('chromedriver')
SELENIUM_DRIVER_ARGUMENTS = ['--headless', '--no-sandbox']

DOWNLOADER_MIDDLEWARES = {
    'scrapy_selenium.SeleniumMiddleware': 800
}

Now create a spider that uses both approaches. This example scrapes a site where product lists are static but individual product pages load reviews dynamically:

import scrapy
from scrapy_selenium import SeleniumRequest
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

class HybridSpider(scrapy.Spider):
    name = 'hybrid'
    
    def start_requests(self):
        # Use regular Scrapy request for listing page
        yield scrapy.Request(
            'http://example.com/products',
            callback=self.parse_listing
        )
    
    def parse_listing(self, response):
        # Extract product URLs from static HTML
        for url in response.css('.product-link::attr(href)').getall():
            # Use Selenium for product detail pages
            yield SeleniumRequest(
                url=response.urljoin(url),
                callback=self.parse_product,
                wait_time=3,
                wait_until=EC.presence_of_element_located((By.CLASS_NAME, 'reviews'))
            )
    
    def parse_product(self, response):
        # Driver is available in response.meta
        driver = response.meta['driver']
        
        # Click "Show More Reviews" button if it exists
        try:
            show_more = driver.find_element(By.CLASS_NAME, 'load-reviews-btn')
            show_more.click()
            driver.implicitly_wait(2)
        except:
            pass
        
        # Now extract data from fully rendered page
        yield {
            'title': response.css('h1.product-title::text').get(),
            'price': response.css('.price::text').get(),
            'rating': response.css('.rating::attr(data-score)').get(),
            'reviews': response.css('.review-text::text').getall(),
            'review_count': len(response.css('.review-item'))
        }

This spider uses regular Scrapy requests for fast list crawling, then switches to Selenium only when needed. The wait_until parameter waits for specific elements before extracting data.

Another pattern: use Scrapy to discover URLs, then process them with Selenium in batches:

class DiscoverAndProcessSpider(scrapy.Spider):
    name = 'discover_process'
    
    def __init__(self):
        self.urls_to_process = []
    
    def start_requests(self):
        yield scrapy.Request('http://example.com/sitemap', callback=self.collect_urls)
    
    def collect_urls(self, response):
        self.urls_to_process = response.css('a.dynamic-page::attr(href)').getall()
        # Trigger Selenium processing after collection
        for url in self.urls_to_process:
            yield SeleniumRequest(url=url, callback=self.parse_dynamic)
    
    def parse_dynamic(self, response):
        # Process with full browser rendering
        pass

This separation keeps your spiders efficient. Don’t use Selenium for pages that don’t need it.

Step 5: Scheduling and Monitoring Your Scrapers

Production scrapers run on schedules. You want fresh data daily, hourly, or weekly without manual intervention. Linux/Mac systems use cron. Windows uses Task Scheduler.

Create a shell script run_scraper.sh:

#!/bin/bash
cd /path/to/bookstore
source venv/bin/activate
scrapy crawl books -o "data/books_$(date +%Y%m%d_%H%M%S).json"
deactivate

Make it executable:

chmod +x run_scraper.sh

Schedule it with cron to run daily at 2 AM:

crontab -e

Add this line:

0 2 * * * /path/to/run_scraper.sh >> /path/to/logs/scraper.log 2>&1

This redirects output to a log file for debugging. Check logs regularly to catch errors.

For more control, use scrapyd, a daemon for running Scrapy spiders:

pip install scrapyd scrapyd-client

Start the daemon:

scrapyd

Deploy your project:

scrapyd-deploy

Schedule a spider via HTTP API:

curl http://localhost:6800/schedule.json -d project=bookstore -d spider=books

Scrapyd provides a web interface at http://localhost:6800 showing running jobs, completed tasks, and logs. You can schedule spiders programmatically and monitor status remotely.

Add monitoring to catch failures. Send notifications when scrapers break:

# In settings.py
EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
}

# In spider
class BooksSpider(scrapy.Spider):
    custom_settings = {
        'CLOSESPIDER_ERRORCOUNT': 10,  # Stop after 10 errors
    }
    
    def __init__(self):
        self.error_count = 0
    
    def parse(self, response):
        if response.status != 200:
            self.error_count += 1
            self.logger.warning(f"Failed request: {response.url}")
            
            if self.error_count > 5:
                self.send_alert("Multiple failures detected")
    
    def send_alert(self, message):
        # Send email, Slack message, etc.
        import smtplib
        # Email notification code here

Log important metrics:

class StatsMiddleware:
    def process_response(self, request, response, spider):
        spider.crawler.stats.inc_value('pages_scraped')
        if response.status != 200:
            spider.crawler.stats.inc_value('failed_requests')
        return response

Access stats after scraping:

stats = spider.crawler.stats.get_stats()
print(f"Scraped {stats.get('item_scraped_count', 0)} items")
print(f"Failed {stats.get('failed_requests', 0)} requests")

Set up alerts for failed jobs, low item counts, or unusual patterns.

Common Pitfalls

Web scraping has legal and technical risks. Many beginners hit the same problems. Avoid these issues:

Ignoring robots.txt: This file tells crawlers which pages they can access. Scrapy respects it by default. Check it manually:

curl http://example.com/robots.txt

If you see Disallow: /, don’t scrape that site. Violating robots.txt can lead to IP bans or legal action.

Hammering servers: Sending 100 requests per second hurts target sites and gets you blocked. Set delays in Scrapy:

# In settings.py
DOWNLOAD_DELAY = 2  # 2 seconds between requests
RANDOMIZE_DOWNLOAD_DELAY = True  # Add randomness
CONCURRENT_REQUESTS = 8  # Limit parallel requests

For Selenium, add explicit waits:

time.sleep(random.uniform(1, 3))  # Random delay between 1-3 seconds

Not handling pagination properly: Many sites use infinite scroll or “Load More” buttons. Missing these means incomplete data. For infinite scroll:

last_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(2)
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

Skipping error handling: Networks fail. Sites change structure. Timeouts happen. Add try-except blocks:

def parse(self, response):
    try:
        title = response.css('h1.title::text').get()
        if not title:
            self.logger.warning(f"No title found: {response.url}")
            return
        yield {'title': title}
    except Exception as e:
        self.logger.error(f"Parse error at {response.url}: {e}")

Hardcoding selectors: CSS selectors break when sites redesign. Use multiple fallback selectors:

title = (
    response.css('h1.product-title::text').get() or
    response.css('h1.item-name::text').get() or
    response.xpath('//h1[@class="title"]/text()').get() or
    "Unknown"
)

Not using user agents: Default Python requests get blocked often. Set a realistic user agent:

# In settings.py
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

Rotate user agents for less detection:

import random
USER_AGENTS = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64)...',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)...',
    'Mozilla/5.0 (X11; Linux x86_64)...',
]

# In spider
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(
            url,
            headers={'User-Agent': random.choice(USER_AGENTS)}
        )

Storing credentials in code: Never hardcode API keys or passwords. Use environment variables:

import os
API_KEY = os.getenv('SCRAPING_API_KEY')

Not validating data: Scraped data has errors. Missing values, wrong formats, duplicates. Validate before storing:

class ValidationPipeline:
    def process_item(self, item, spider):
        if not item.get('title'):
            raise DropItem("Missing title")
        if not item.get('price', '').replace('£', '').replace('.', '').isdigit():
            raise DropItem("Invalid price format")
        return item

Remember: just because you can scrape something doesn’t mean you should. Check terms of service. Some sites explicitly forbid automated access. Others require attribution. Respect these rules.

Summary

You now have tools for both static and dynamic web scraping. Scrapy handles large-scale crawling efficiently. Selenium manages JavaScript-heavy sites that need browser interaction. Combining them gives you flexibility to handle any scraping scenario.

Start with Scrapy for speed. Add Selenium only when JavaScript rendering is necessary. Store data in structured formats. Schedule scrapers to run automatically. Monitor for failures and adapt to site changes.

The examples here used public test sites and ethical scraping practices. Apply these patterns to your projects, but always check legal requirements and terms of service first. Web scraping is a tool, and like any tool, it requires responsible use.

Your next steps: build a spider for a site you need data from, test it thoroughly with small batches, then scale up with proper scheduling and monitoring. The code patterns here work for most common scenarios, but you’ll adapt them as you encounter new challenges.

Web Scraping Automation with Scrapy and Selenium

Prerequisites

Step 1: Your First Scrapy Spider

Step 2: Handling Dynamic Content with Selenium

Step 3: Data Extraction and Storage

Step 4: Combining Scrapy and Selenium

Step 5: Scheduling and Monitoring Your Scrapers

Common Pitfalls

Summary

Leave a comment

No comments yet

Prerequisites

Step 1: Your First Scrapy Spider

Step 2: Handling Dynamic Content with Selenium

Step 3: Data Extraction and Storage

Step 4: Combining Scrapy and Selenium

Step 5: Scheduling and Monitoring Your Scrapers

Common Pitfalls

Summary

Share this guide

Leave a comment

No comments yet

Related Articles

Python Automation in Practice: 5 Scripts to Boost Your Productivity 10x

Top Python Libraries for AI Workflow Automation

Python Automation Evolution: From Scripts to Enterprise Workflows