Soup is Beautiful, OR, How to Scrape the LCBO Website

Webscraping is an awkward and fragile process: you rely on the website maintainers to not change the structure of their HTML (which they will always do), and you need to know not only the language your scraping library is written in (in this case Python) but also have a good understanding of both HTML and CSS.

Step 1 in scraping the LCBO website is to make some sense of their search URL. Go to their site ( https://www.lcbo.com/ ) and enter a search for a product that interests you. I searched for "dillon's rose gin" and was sent to this URL:

https://www.lcbo.com/webapp/wcs/stores/servlet/SearchDisplay?categoryId=&storeId=10203&catalogId=10051&langId=-1&sType=SimpleSearch&resultCatEntryType=2&showResultsPage=true&searchSource=Q&pageView=&beginIndex=0&pageSize=12&searchTerm=dillon%27s+rose+gin

I spent a lot of time tinkering with the various settings implied by the parts of the URL. The most obvious and necessary is &searchTerm=dillon%27s+rose+gin. Note that the single quote has been changed into an HTML entity, the %27. You can (not saying you should, but it works and makes it easier to look at and work with) remove a number of the parameters:

https://www.lcbo.com/webapp/wcs/stores/servlet/SearchDisplay?storeId=10203&showResultsPage=true&beginIndex=0&pageSize=20&searchTerm=dillon%27s+rose+gin

Notice I've changed the pageSize=20 (the value was originally 12) - this sets the number of search results per page.

Step 2 is to install a scraping library. I'm a Python user, and have heard many good things about BeautifulSoup. So: dnf install python3-beautifulsoup4 - it can of course be installed with pip or pip3, but I favour OS-level packages whenever possible as I'm far better at keeping them up-to-date.

Step 3 isn't a step I expected, but it didn't surprise me much. Write out the basic structure:

import requests
from bs4 import BeautifulSoup as Soup

searchTerm = "dillon%27s+rose+gin"
URL = "https://www.lcbo.com/webapp/wcs/stores/servlet/SearchDisplay?storeId=10203&showResultsPage=true&beginIndex=0&pageSize=20&searchTerm=" + searchTerm
page = requests.get(URL)

soup = Soup(page.content, 'html.parser')
print(soup)

The result I received back was this:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">

<html><head>
<title>403 Forbidden</title>
</head><body>
<h1>Forbidden</h1>
<p>You don't have permission to access /webapp/wcs/stores/servlet/SearchDisplay
on this server.</p>
<hr/>
<address>IBM_HTTP_Server at www.lcbo.com Port 443</address>
<script async="async" src="/__zenedge/assets/f.js?v=1545403345" type="text/javascript"></script><script>(function () { var v = 1552763472 * 3.1415926535898; v = Math.floor(v); document.cookie = "__zjc4444="+v+"; expires=Sat, 16 Mar 2019 19:12:12 UTC; path=/"; })()</script></body></html>

I didn't have to do a search to know why their site was throwing a 403 on a legitimate search - years of working on and with websites brought me an immediate and, as it turned out, accurate guess. They're filtering on user agent. I haven't checked what UA Python's request library uses, but it's either an empty string or some indication that it's a scraper rather than a browser. And a lot of websites try to block scrapers. Which is tricky at the best of times, but pretty much impossible to do based on UA as anyone - and any program - can change their UA at will:

import requests
from bs4 import BeautifulSoup as Soup

searchTerm = "dillon%27s+rose+gin"
UA = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36'
URL = "https://www.lcbo.com/webapp/wcs/stores/servlet/SearchDisplay?storeId=10203&showResultsPage=true&beginIndex=0&pageSize=20&searchTerm=" + searchTerm
page = requests.get(URL, headers={'User-Agent':UA})

soup = Soup(page.content, 'html.parser')
print(soup)

With that change made, I get the search results returned. I went to https://techblog.willshouse.com/2012/01/03/most-common-user-agents/ and chose what they claim is currently the world's most common user agent on the basis that the LCBO won't be blocking that UA any time soon ...

Now we can start parsing the output structure for the information we want.

Step 4: start looking at the structure of the search results page. For this we use our browser: in Firefox I right-click on an item of interest and select "Inspect Element". Chrome has essentially identical functionality. This is where you need to understand HTML and CSS, as we're going to be hunting for the parent element of each result, and then finding a set of child elements we're interested in. For me, those elements are:

  • the exact product name (at least according to the LCBO - they're not entirely reliable, but we make do)
  • the price
  • the special "not available" tag
  • the "deliver to store" field

This process is complicated by search results sometimes jumping directly to a product page (rather than a search result page) on the occasion that you manage to find a search unicorn - ie. you've entered a phrase for which they have only one result. This entry is mostly concerned with dealing with search results pages, but you have to be aware that the other result is a possibility.

Results are in the form of a list, with each item being in a tag like this: <li class="ui-block-a">...</li>. Remember what I said about the fragility of webscraping? If you're reading this more than a week after I wrote it, you should be re-researching every element you read in my list because they're going to have changed one (and possibly several) of them. Better yet, your browser or web scraper may not get the same page layout. In fact, I had BeautifulSoup search on ui-block-a and got nothing: in the end, I had to revert to printing the entire page to a file as seen by BeautifulSoup (see the last example) and examining it for the page layout. What I ended up with is this:

for item in soup.select("ul.list_mode li"):
    # Now we further dissect the <li> blocks that BS has found.
    #
    # there should only be one here but ".select" returns a list:
    for li in item.select("div.productChart div a"):
        print(li.get_text())
    for price in item.select("div.product_price"):
        print(price.get_text().strip())
    print("     -----")

I have a lot of work to do to get all the results I want, but what we have is a great start:

Dillon's Rose Gin
$24.95
     -----
Dillon's Gin 22 Unfiltered
$39.35
     -----
Dillon's Dry Gin
$39.30
     -----
Dillon's Cherry Gin
$24.70
...

Step 5: Spend some time looking up the multiple other, probably better, LCBO scrapers written by other people. I'll probably continue with my own project even after a friend pointed out that these other projects existed, because none of them does exactly what I want them to do. Maybe you'll fare better with the other projects.