In Depth Guide: HTML Scraping
Under Construction
This section is being updated. Some information may be outdated or inaccurate.
The preferred method of scraping recipe information from a web page is to use the schema.org Recipe data. This is a machine readable, structured format intended to provide a standardised method of extracting information. However, whilst most recipe websites use the schema.org Recipe format, not all do, and for those websites that do, it does not always include all the information we are looking for. In these cases, we can use HTML scraping to extract the information from the HTML markup.
soup
Each scraper has a BeautifulSoup
object that can be accessed using the self.soup
attribute.
The BeautifulSoup
object is a representation of the web page HTML that has been parsed into a
format that we can query and extract information from.
The Beautiful Soup documentation is the best resource for learning how to use BeautifulSoup
objects to interact with HTML documents.
This guide covers a number of common patterns that are used in this library.
_schema_cls
and _opengraph_cls
It should rarely be necessary to override the default behaviour of schema.org and OpenGraph
metadata retrieval; recipe websites should generally adhere to their respective standard formats
when including metadata on their webpages. However, bugs/mistakes do happen - if you need to
override the implementations provided by the SchemaOrg
and OpenGraph
classes, you can subclass
from those and add a _schema_cls
or _opengraph_cls
attribute to your scraper class to instruct
the library to use them instead.
Finding a single element
The self.soup.find()
function returns the first element matching the arguments. This is useful if
you are trying to extract some information that should only occur once, for example the prep time
or total time.
# To find a particular element
self.soup.find("h1") # Returns the first h1 element
# To find an element with particular class (note the underscore at the end of class_)
self.soup.find(class_"total-time") # Returns the first element with total-time class.
# To find an element with a particular ID
self.soup.find(id="total-time")
# You can include multiple arguments to be more specific
# To find the first h1 element with "title" class
self.soup.find("h1", class_="title")
self.soup
returns a bs4.element.Tag
object. Usually we just want the text from the selected
element and the best way to do that is to use .get_text()
.
.get_text()
will get the text from all child elements, as it would appear in your browser, so
there is no need to iterate through all the children, call .get_text()
on each one, then join
the results afterwards.
As an example, consider one of the ingredients in this recipe. The markup looks like this:
<li class="wprm-recipe-ingredient" style="list-style-type: none;" data-uid="0">
<span class="wprm-checkbox-container">
<input type="checkbox" id="wprm-checkbox-1" class="wprm-checkbox" aria-label=" 1 tablespoon coconut oil (or oil of choice)">
<label for="wprm-checkbox-1" class="wprm-checkbox-label">
<span class="sr-only screen-reader-text wprm-screen-reader-text">▢ </span>
</label>
</span>
<span class="wprm-recipe-ingredient-amount">1</span>
<span class="wprm-recipe-ingredient-unit">tablespoon</span>
<span class="wprm-recipe-ingredient-name">coconut oil</span>
<span class="wprm-recipe-ingredient-notes wprm-recipe-ingredient-notes-normal">(or oil of choice)</span>
</li>
We can select this element using its tag and class (we're pretending this recipe only has this one ingredient), and extract the text like so:
ingredient_tag = self.soup.find("li", class_="wprm-recipe-ingredient")
ingredient_text = ingredient_tag.get_text()
# '1 tablespoon coconut oil (or oil of choice)'
The Beautiful Soup documentation for find
is here.
Normalizing strings
A convenience function called normalize_string()
is provided in the _utils
package. This
function will convert any characters escaped for HTML to their actual character (e.g. &
to &
) and remove unnecessary white space. It is best practice to always use this when extracting
text from the HTML.
from ._utils import normalize_string
# ...
ingredient_tag = self.soup.find("li", class_="wprm-recipe-ingredient")
ingredient_text = normalize_string(ingredient_tag.get_text())
Getting yields
A convenience function called get_yields()
is provided in the _utils
package. This function
accepts a str
or bs4.element.Tag
and will return the yield, handling many of the common
formats yields can appear in and normalizing them to a standard format.
from ._utils import get_yields
# ...
yield_tag = self.soup.find(class_="wprm-recipe-servings")
yield_text = get_yields(yield_tag)
# or
yield_text = get_yields(yield_tag.get_text())
# both return '4 servings'
Getting times
A convenience function called get_minutes()
is provided in the _utils
package. This function
accepts a str
or bs4.element.Tag
and will return the number of minutes as an int
. This
function handles a number of common formats that times can be expressed in.
from ._utils import get_minutes
# ...
prep_time_tag = self.soup.find(class_="wprm-recipe-prep_time-minutes")
prep_time_value = get_minutes(prep_time_tag)
# or
prep_time_value = get_minutes(prep_time_tag.get_text())
# both return 25
Finding multiple elements
Some information in a recipe, like the ingredients or instructions, come in the form of lists where
we need to find multiple elements with the same attributes. We can use self.soup.find_all()
for
this. find_all
uses the same arguments as find
, it just returns a list of bs4.element.Tag
objects with all the matching elements.
Using the same site as above, we can find all the ingredients like so
ingredient_tags = self.soup.find_all("li", class_="wprm-recipe-ingredient")
ingredients_text = [normalize_string(tag.get_text()) for tag in ingredient_tags]
"""
[
'2 (20-ounce // 565g) cans of jackfruit (in water or brine)*',
'1 tablespoon coconut oil (or oil of choice)',
'1 1/2 teaspoons cumin seeds',
'1 1/2 teaspoons black mustard seeds (can substitute brown mustard seeds)',
'1 large yellow onion, diced',
...
]
"""
The Beautiful Soup documentation for find_all
is here.
Using CSS selectors
If you are already familiar with CSS selectors, then you can use select()
to achieve the same
result as find_all()
, or select_one()
to achieve the same result as find
.
# Match all li elements with wprm-recipe-ingredient class
ingredient_tag = self.soup.select("li.wprm-recipe-ingredient")
The Beautiful Soup documentation for select
is here. MDN has a guide on
CSS selectors here.
Finding elements using a partial attribute
Sometimes you might want to find elements using a part of an attribute. This is particularly helpful for websites that automatically generate CSS in a way that appends a random string to the end of class names.
An example of this is cooking.nytimes.com. If we wanted to select the yield element from
this page, we could use the class ingredients_recipeYield__DN65p
. However when the website is
updated in the future, the DN65p
at the end of the class name is likely to change, so we only
want to use part of the class name.
There are two ways we can do this:
Using find
Instead of using a string in the arguments we pass to find
, we can use a regular expression
instead.
yield_tag = self.soup.find(class_=re.compile("ingredients_recipeYield"))
yield_text = yield_tag.get_text()
# Yield:4 servings
Using select
CSS also supports partial attribute matching. MDN has a useful guide here.