Skip to content

How To Develop a New Scraper

Under Construction

This section is being updated. Some information may be outdated or inaccurate.

Find a website

First, check if the website is already supported:

from recipe_scrapers import SCRAPERS
# Check if site is supported
print(SCRAPERS.get("bbcgoodfood.com"))

Track Your Progress

Create an issue to track your work.

Setup Repository

Fork the recipe-scrapers repository on GitHub and follow these steps:

Quick Setup

# Clone your fork
git clone https://github.com/YOUR-USERNAME/recipe-scrapers.git
cd recipe-scrapers

# Set up Python environment
python -m venv .venv
source .venv/bin/activate  # On Windows use: .venv\Scripts\activate
python -m pip install --upgrade pip
pip install -e ".[all]"

Create a new branch:

git checkout -b site/website-name

Run Tests

python -m unittest

# Optional: Parallel testing
pip install unittest-parallel
unittest-parallel --level test

Generate Scraper Files

1. Select Recipe URL

Recipe Selection

Choose a recipe with multiple instructions when possible. Single-instruction recipes may indicate parsing errors, unless explicitly handled.

2. Check Schema Support

Test if the site uses Recipe Schema:

from urllib.request import urlopen
from recipe_scrapers import scrape_html

url = "https://example.com/your-recipe"
html = urlopen(url).read().decode("utf-8")

scraper = scrape_html(html, url, wild_mode=True)
print(scraper.schema.data)  # Empty dict if schema not supported

3. Generate Files

python generate.py <ClassName> <URL>

<URL> should be the recipe page you selected in the first step. The script downloads this recipe and uses it to create the initial test data.

This creates:

  • Scraper file in recipe_scrapers/
  • Test files in tests/test_data/<host>/

Implementation

from recipe_scrapers import scrape_html

scraper = scrape_html(html, url)
print(scraper.title())
print(scraper.ingredients())
def title(self):
    return self.soup.find('h1').get_text()

Testing

1. Update Test Data

Edit tests/test_data/<host>/test.json:

{
    "host": "<host>",
    "canonical_url": "...",
    "site_name": "...",
    "author": "...",
    "language": "...",
    "title": "...",
    "ingredients": "...",
    "instructions_list": "...",
    "total_time": "...",
    "yields": "...",
    "image": "...",
    "description": "..."
}

Test Data Population Help

The HTML file generated by generate.py can be used to help you fill in the required fields within the test JSON file

from pathlib import Path
from recipe_scrapers import scrape_html
import json

html = Path("tests/test_data/<host>/<TestFileName>.testhtml").read_text(encoding="utf-8")
scraper = scrape_html(html, "<URL>")
print(json.dumps(scraper.to_json(), indent=2, ensure_ascii=False))

This will print the output returned by the scraper to your terminal for reference

2. Run Tests

python -m unittest -k <ClassName.lower()>

Edge Cases

Test with multiple recipes to catch potential edge cases.

Submit Changes

  1. Commit your work:

    git add -p  # Review changes
    git commit -m "Add scraper for example.com"
    git push origin site/website-name
    

  2. Create a pull request at recipe-scrapers