BeautifulSoup Example

18 minute read

Published:

Beautiful Soup example

I use Google Calendar to organize my life, so I was disappointed–nay, horrified–when I learned that Purdue didn’t provide any sort of reasonable format for the calendar, only a webpage or a PDF document (this year’s offical calendar is at https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html)

So, I decided to write a little script to grab the information from the website and put it into a CSV file that Google Calendar can understand (CSV format for their calendars taken from https://support.google.com/calendar/answer/37118?hl=en).

I’m writing this up and publishing it for two audiences: first, other Purdue community members who want to import the calendar into their own calendar program (although the easier way is to just subscribe to my public Google calendar) and as an example of doing screen scraping with Beautiful Soup.

# First, we import the required libraries
import requests # This fetches the webpage
from bs4 import BeautifulSoup # This is the parsing library
from datetime import datetime, timedelta # Need to convert things to/from date objects
import re # Regular expressions
import pandas as pd # Pandas, just for exporting the final result

The first step is to download the webpage as HTML, and then to parse it into a soup object.

doc = requests.get('https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html')
soup = BeautifulSoup(doc.text)

At this point, it makes sense to explore the structure of the page. You could look at the HTML or at the soup object but it’s usually much easier to right click on the page itself and “Inspect” it (both Chrome and Firefox have this).

What we want to do is figure out a good path to the data that we want. What we want are the dates and the descriptions for the events. If you look at the page, it’s structured something like this (with irrelevant portions omitted):

<html>
    ...
    <div class = "maincontent col-lg-9 col-md-9 col-sm-9 right">
        <h4> August </h4>
        ...
        <table>
            <tr>
                <td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">
                    19
                </td>
                <td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9">
                    <strong>7:30 a.m.</strong><br>
                    FALL SEMESTER CLASSES BEGIN
                </td>
                ...
            </tr>
        </table>
        ...
        <h4> September </h4>
        ...
    </div>
   ...
</html>

What we’re looking for are unique descriptors that will get us the data we’re looking for and nothing else.

All of the calendar data is in the div with the ‘maincontent’ class, so we’ll use that to limit our search. Then, dates are in <h4> tags, followed by a table with a row for each entry in the calendar for that month.

At this point, it takes some playing around to find the right syntax. It turns out that each month header is followed directly by the table of the events, making the event table the ‘sibling’ of the month, which we can access with BeautifulSoup’s find_nex_sibling() function.

for item in soup.find('div', 'maincontent').find_all('h4')[:2]: # Only looking at the first few for testing
    print(item.text) # Should be the month
    table = item.find_next_sibling()
    print(table)
August
<table class="calendarTable" summary="blah">
<tbody>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">19</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong>7:30 a.m.</strong><br/>FALL SEMESTER CLASSES BEGIN</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">26</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong>5 p.m.</strong><br/>Last Day to Register Without a Late Fee</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
</tbody>
</table>
September
<table class="calendarTable" summary="blah">
<tbody>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">2</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9">Last Day to Cancel a Course Assignment Without It Appearing On Record</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">2</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9">LABOR DAY (No Classes)</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">16</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong><strong>5 p.m.</strong><br/></strong>Last Day to Withdraw a Course With a Grade of W or to Add/Modify a Course With Instructor and Advisor Signature</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
<tr>
<td class="day noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">30</td>
<td class="description col-lg-11 col-md-10 col-sm-10 col-xs-9"><strong><strong><strong>5 p.m.</strong><br/></strong></strong>Last Day For Grade Correction for Spring Semester 2018-19 and 2019 Summer Session</td>
<td class="weekDay noGutterLeft col-lg-1 col-md-2 col-sm-2 col-xs-3">Mon</td>
</tr>
</tbody>
</table>

Parsing the table

This looks really good, so the next step is just to extract the content from the table.

Each event has its own row, so we parse the table to find the <td> tags with the information we’re looking for.

for item in soup.find('div', 'maincontent').find_all('h4')[:3]:
    curr_month = item.text
    table = item.find_next_sibling()
    for event in table.find_all('tr'):
        day_string = event.find('td', 'day').text # Get the text from the table cell with the 'day' class
        description_string = event.find('td', 'description').text
        print(curr_month, day_string, description_string)
August 19 7:30 a.m.FALL SEMESTER CLASSES BEGIN
August 26 5 p.m.Last Day to Register Without a Late Fee
September 2 Last Day to Cancel a Course Assignment Without It Appearing On Record
September 2 LABOR DAY (No Classes)
September 16 5 p.m.Last Day to Withdraw a Course With a Grade of W or to Add/Modify a Course With Instructor and Advisor Signature
September 30 5 p.m.Last Day For Grade Correction for Spring Semester 2018-19 and 2019 Summer Session
October 7 
Schedule of Classes published for Spring 2020 Term

October 7-8 
OCTOBER BREAK

October 16 
7:30 a.m.Second Eight-Week Courses Begin

October 22 
5 p.m.Last Day To Withdraw From a Course With a W or WF Grade

October 22 
5 p.m.Last Day To Add/Modify a Course With Instructor, Advisor, and Department Head Signatures

Data Cleaning

It looks like we have the data that we need, but we need to deal with two problems.

First, some events include a start date and an end date (e.g., October 7-8), so we need to figure out how to deal with that.

Second, the descriptions sometimes include times (which I don’t want), sometimes have linebreaks, etc. We want to clean them up to just show the event name itself.

Let’s start by figuring out the start and end date problem. We basically want to turn the string ‘7-8’ into the list ['7','8']. We do that with the split() function, which takes as an argument the character to use to split on:

'7-8'.split('-')
['7', '8']

We clean up the event names using the strip() function, which removes whitespace and using regular expressions to remove the time.

Regular expressions are a super powerful way of manipulating text and are worth learning. Unlike much of python, they are very difficult to parse so I’ll take a minute to explain the following code:

re.sub('^\d.*?[ap]\.m\.','',description_string.strip())

re.sub is a function from the regular expression library that takes in 3 arguments: the regular expression pattern you want to search for, what you want to replace that pattern with, and the string where you want to search.

The pattern ^\d.*?[ap]\.m\. finds a time if it appears at the beginning of some descriptions. They take various formats, such as ‘5:30p.m.’ or ‘4 p.m.’ and we need to capture all of them.

^ means to start searching at the beginning of the string
\d means that the first character is a digit (i.e., a number from 0-9)</br> . means look for any character, * means of any quantity, so \d.* means to look for a number followed by zero or more characters of any type.
[ap] means to look for either the ‘a’ or ‘p’ characters, and [ap]\.m\. means to look for ‘a’ or ‘p’, followed by a ‘.’, followed by ‘m’, followed by another ‘.’ (the '\' characters mean to treat ‘.’ like a normal ‘.’ and not as a representation for any character).
Finally, you may have noticed that I skipped the ?: This means to take the previous expression and be ‘non-greedy’: in other words, make it as small as possible.

An example may make this more clear:

string = 'Lorem ipsum dolor'

re.search('^L.*m', string).group(0)
'Lorem ipsum'
re.search('^L.*?m', string).group(0)
'Lorem'

In the first case, the pattern matches until the last m that it finds, while in the second it matches until the first m.

So, taken together the original regular expression means, “Look for a digit at the beginning of the string. Then, get all of the text until you come to either ‘a.m.’ or ‘p.m.’.” The next argument to re.sub, the empty string (‘’), means to remove the text that matches.

Putting it together

Here’s some code that puts all of this together.

After the first version of this, I realized that the dates go over multiple years and the year is not listed anywhere on the page, so we need to keep track of when we move into the next year. We do this by assuming that the start dates are in chronological order. By converting each event date into a datetime we can identify when the current entry is earlier than the last entry, and assume that means we’ve actually moved to the next year.

events = []
curr_month = None
curr_year = '2019'
# Initialize to earliest possible date (used to change year)
last_date = datetime.strptime('2019-01-01', '%Y-%m-%d')


def get_start_end_dates(year, month, days):
    '''
    Function to convert strings of year, month, and days into a datetime object
    
    Input:
        year: 4-digit string
        month: Full name of month (e.g., 'January')
        day: String of one date or two dates separated by '-'
        
    Output:
        A tuple of datetime objects representing the start and end dates.
        
    Assumes that both days are within the same month.
    '''
    def get_date(day):
        '''Very simple helper function that converts strings to a datetime'''
        return datetime.strptime(month + day + year, '%B%d%Y')   
    
    days = days.split('-')
    start_date = get_date(days[0])
    end_date = get_date(days[-1])
    return (start_date, end_date)

for item in soup.find('div', 'maincontent').find_all('h4'):
    curr_month = item.text
    table = item.find_next_sibling()
    for event in table.find_all('tr'):
        day_string = event.find('td', 'day').text # Get the text from the table cell with the 'day' class
        start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string)
        if start_date < last_date:
            curr_year = '2020' # update the year if the current event is out of order
            start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string) # And get new start and end dates
        last_date = start_date
        description_string = event.find('td', 'description').text
        description = re.sub('^\d.*?[ap]\.m\.','',description_string.strip())
        
        # CSV format is given at https://support.google.com/calendar/answer/37118?hl=en
        curr_event = {
            'Subject': description,
            'Start Date': start_date,
            'End Date': end_date,
            'All Day Event': True,
            'Private': False
        }
        events.append(curr_event)
        

# We then use pandas simply to write the output to a CSV file
pd.DataFrame(events).to_csv('/home/jeremy/Desktop/DeleteMe/purdue_cal.csv', 
                           index=False)
        

We’ll take a quick look at the output to make sure it worked

pd.read_csv('/home/jeremy/Desktop/DeleteMe/purdue_cal.csv').sample(10)
SubjectStart DateEnd DateAll Day EventPrivate
52Third 4-Week Summer Module Begins2020-07-132020-07-13TrueFalse
17COMMENCEMENT (First Division)2019-12-152019-12-15TrueFalse
5412-Week Full Summer Module Ends (Grades due by...2020-08-072020-08-07TrueFalse
34Deadline For Pending Spring 2019 Incomplete Gr...2020-05-092020-05-09TrueFalse
6Schedule of Classes published for Spring 2020 ...2019-10-072019-10-07TrueFalse
56Third 4-Week Summer Module Ends (Grades due by...2020-08-072020-08-07TrueFalse
4312-Week Full Summer Module Begins2020-05-182020-05-18TrueFalse
18COMMENCEMENT (Second Division)2019-12-152019-12-15TrueFalse
1Last Day to Register Without a Late Fee2019-08-262019-08-26TrueFalse
23Last Day to Cancel a Course Assignment Without...2020-01-272020-01-27TrueFalse

It looks great, and works for events with different start and end dates. This is ready to upload to Google Calendar. But let’s level up.

Looping across pages

If I knew I only wanted to get data this one time from this one page then what I have above would be just fine. However, there are calendars for multiple years and I’d like to get all of them. How could we extend what we’ve done?

What I want to do is put the code into a function that I can call with different pages.

So, what’s missing to let us do that?

First, we have the year hardcoded in - we need to figure out how to either extract it or take it as a parameter.

Second, we need to aggregate the content before writing it to a CSV file.

Getting the year

I noticed that the year is in the URL, so I’m thinking we can take the url as a parameter and just extract the year, like so:

re.search('(\d{4})-', 'https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html').group(1)
'2019'

\d{4} looks for a 4 digit number and the parentheses mean to put it in its own ‘group’. The - means that the number has to come right before a dash.

Then, at the end of the expression, group(1) means to return the first group that was in parentheses.

To update the year, we’ll need to take out the hardcoding. In order to increment a year that’s a string we have to turn it into an integer, add 1 to it, and then turn it back into a string, like so:

str(int('2020') + 1)
'2021'

Aggregating the data

In order to aggregate the data, we’ll create a function that will get all of the events from one page and return them as a list of event dictionaries. We will have an outer loop that figures out which URLs to look at, passes them to the function, and then appends the events from each page together and saves them.

def get_events(url):
    # Notice that the URL includes the year in it, so we can extract that
    
    doc = requests.get(url)
    soup = BeautifulSoup(doc.text)

    events = []
    curr_month = None
    curr_year = re.search('(\d{4})-', url).group(1)
    # Initialize to earliest possible date (used to change year)
    last_date = datetime.strptime(curr_year + '-01-01', '%Y-%m-%d')


    def get_start_end_dates(year, month, days):
        '''
        Function to convert strings of year, month, and days into a datetime object

        Input:
            year: 4-digit string
            month: Full name of month (e.g., 'January')
            day: String of one date or two dates separated by '-'

        Output:
            A tuple of datetime objects representing the start and end dates.

        Assumes that both days are within the same month.
        '''
        
        def get_date(day, month = month, year = year):
            '''Very simple helper function that converts strings to a datetime'''
            if not re.match('\d', day): # If it doesn't start with a digit, it crosses months
                month, day = day.split() # Split on whitespace and grab the month and day
            try:
                return datetime.strptime(month + day + year, '%B%d%Y')
            except ValueError:
                month = month.strip('.')
                return datetime.strptime(month + day + year, '%b%d%Y')
        
        days = days.split('-')
        start_date = get_date(days[0])
        end_date = get_date(days[-1])
        return (start_date, end_date)

    for item in soup.find('div', 'maincontent').find_all('h4'):
        curr_month = item.text
        table = item.find_next_sibling()
        for event in table.find_all('tr'):
            day_string = event.find('td', 'day').text # Get the text from the table cell with the 'day' class
            start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string)
            if start_date < last_date:
                curr_year = str(int(curr_year) + 1) # update the year if the current event is out of order
                start_date, end_date = get_start_end_dates(curr_year, curr_month, day_string)
            last_date = start_date
            description_string = event.find('td', 'description').text
            description = re.sub('^\d.*?[ap]\.m\.','',description_string.strip())

            # CSV format is given at https://support.google.com/calendar/answer/37118?hl=en
            curr_event = {
                'Subject': description,
                'Start Date': start_date,
                'End Date': end_date,
                'All Day Event': True,
                'Private': False
            }
            events.append(curr_event)
    return events # Return the list of all events

Now we can write our outside function, which by convention is called ‘main()’.

Currently, Purdue lists calendars until 2024-2025; let’s grab all of them.

We can build a simple loop to create the URLs based on the structure we know they have

for i in range(19, 25):
    url = 'https://www.purdue.edu/registrar/calendars/20{}-{}-Academic-Calendar.html'.format(i, i+1)
    print(url)
https://www.purdue.edu/registrar/calendars/2019-20-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2020-21-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2021-22-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2022-23-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2023-24-Academic-Calendar.html
https://www.purdue.edu/registrar/calendars/2024-25-Academic-Calendar.html
def main(filename):
    events = []
    for i in range(19, 25):
        url = 'https://www.purdue.edu/registrar/calendars/20{}-{}-Academic-Calendar.html'.format(i, i+1)
        events = events + get_events(url)
    
    pd.DataFrame(events).to_csv(filename, 
                           index=False)

As is often the case when enlarging the data you’re retrieving, I found a weakness. In some of the future years, events actually go across multiple months. In these cases, the month was included in the day td, so the day list ended up being something like ['April 25', 'May 1']. This made it super easy to fix - I just wrote a little regex to look if a day started with a number or a letter and to parse the month out if it started with a letter.

I put the code in the get_date helper function above.

Finally

Ok, now we’re finally ready to run it. We run the main function and then check the output to make sure it looks reasonable.

After it’s done, I’ll use it to update my Google Calendar.

main('/home/jeremy/Documents/purdue_cal.csv')
# Look at some random rows to see how it looks
pd.read_csv('/home/jeremy/Documents/purdue_cal.csv').sample(15)
SubjectStart DateEnd DateAll Day EventPrivate
37COMMENCEMENT (Second Division)2020-05-152020-05-15TrueFalse
131SEMESTER ENDS2021-12-182021-12-18TrueFalse
152COMMENCEMENT (First Division)**2022-05-132022-05-13TrueFalse
341Second 4-Week Summer Module Ends (Grades due b...2025-07-112025-07-11TrueFalse
282First 8-Week Summer Module Ends (Grades due by...2024-07-052024-07-05TrueFalse
26Second Eight-Week Courses Begin2020-03-092020-03-09TrueFalse
40COMMENCEMENT (Fifth Division)2020-05-172020-05-17TrueFalse
25Last Day For Grade Correction For Fall Semest...2020-02-242020-02-24TrueFalse
186CLASSES END2022-12-102022-12-10TrueFalse
234Last Day to Cancel a Course Assignment Without...2023-09-042023-09-04TrueFalse
326COMMENCEMENT (First Division)**2025-05-162025-05-16TrueFalse
88Spring Vacation2021-03-152021-03-20TrueFalse
246FINAL EXAMS2023-12-112023-12-16TrueFalse
160First 4-Week Summer Module Begins2022-05-162022-05-16TrueFalse
183Last Day To Withdraw From a Course With a W or...2022-10-252022-10-25TrueFalse

Conclusion

So, that is a basic introduction to BeautifulSoup and to using it to parse web pages.

As you can see, it’s a really powerful tool. In this case, I could have done the work manually probably in less time than it took me to write this script but imagine having hundreds or thousands or more web pages to parse. This kind of scripting can save hundreds of hours of time. And, it’s fun!