john dejesus linkedin photo.jpg

Hi, Hola!, and 你好!

Welcome to my Data Science Journey Blog! 

Hope you enjoy the posts and pick up something new!

 

Basic Web Scraping with Python

Basic Web Scraping with Python

Hi there! John DeJesus here! Welcome to my Data Science Blog!

This will be a place to share my journey through data science applications and news. I hope everyone enjoys the posts and are able to take away something from each of them.

For the first post, I will walk you through a basic application of web scrapping using Python 3. 

This will be my first web scraping application using a similar approach based on Kevin Markham's web scraping demonstration videos on Youtube (Twitter tag:@justmarkham).

**Note. Depending on when you read this, the end results may not be the same since the data on the page may have changed. The end result is current as of 12/3/17**

But wait.....what is web scraping?

Web scraping is the extraction of information from a webpage.

This is a useful skill in case the data you need for an analysis does not exist in a tightly (most likely not) csv file (or multiple other sources). Thus one of the only places to retrieve that data may be the webpage(s) that possess that information.

To illustrate this, we will be scraping the name, positions, and email addresses from the staff page of this Bronx high school

Step 1: Import the webpage into python using requests library

import requests
p=requests.get('https://www.newvisions.org/ams2/pages/our-staff2')

This is the only purpose for the requests library. We will define p as the webpage using requests.get().

Step 2: Import BeautifulSoup4 and apply to webpage

#Import webpage into BeautifulSoup and parsing it
from bs4 import BeautifulSoup
soup=BeautifulSoup(p.text, 'html.parser')

#Create set based on HTML tags with desired data
results=soup.find_all('div', attrs={'class':'matrix-content'})
len(results)
results=results[27:]
len(results)

We will import BeautifulSoup from bs4 to parse the html code from our webpage "p". Next we will investigate the HTML code from the webpage to find the desired section of code we need. To view a webpage's HTML code, go to the webpage, right click and select "View page source". You can then  ctrl-f to find a staff member's name to see the piece of HTML code where their name and info is embedded. 

If you scroll a bit through the code, you should notice that pieces of the code are enclosed by lines of code such as :

<title>......</title>

or

<p>.....</p>

These are known as tags in the HTML code. Between some of these tags are the information we want to scrape. 

Since we see the desired information is in between the 'div' tag with the class='matrix-content', we can assume that the info for all the teachers is in each tag with that class. That is why we use the tag and the class as the parameter for the find_all property of soup.

The first instance of "len(results)" is to see how many staff members there are on the webpage. This gives us a total of 74 staff members.  But since we are only collecting data on the teachers, this number is too high. So we need to start at the index where the first teacher profile occurs. The first teacher to appear is "Mr. Brogan". You can use ctrl-f to search for his name in the HTML code. If you count (starting from 0 of course), Mr. Brogan's index is 27. That is why we are redefining results starting from index 27. A check on the length of results and a mental count of the removed staff members confirms we are good to go!

Step 3: Read the HTML code for tag patterns

Lets look at the HTML code for one of the teacher's profiles. We will again inspect Mr. Brogan's info:

 <div class="matrix-content">
        <h5>Mr. Brogan</h5>                   
        <div class="matrix-copy"><p>
    Special Education: Geometry, Geometry Modeling, SETSS</p>
<p>
    <em>rbrogan31@charter.newvisions.org</em></p>
</div>
                                            </div>

Again, we need to determine the tags that contain the teacher's name, position(s), and his/her email. Take a second to try and answer this question yourself, then read on to see if you were right. This will provide good practice for investigating what parts of the HTML you need to indicate in your python scraping code. Remember the examples I showed you earlier.

Name tag: The name is between the tags marked 'h5'.

Postion(s) tag: The position(s) is located between the 'p' tags after the class tag.

<div class="matrix-copy">

Email tag: The email is between the tags 'p' and 'em'. Since the 'em' tag direct encases the email, that is the tag we will indicate in our scraping code.

Great! Now that we found the tags we need to indicate, lets write the code for our first teacher to determine how we will loop through the teacher entries to get all of the data!

Step 4: Experiment with scraping a single entry

When I need to create a loop it is not always straight forward for me. Sometimes I have to do a test run of the code for a single instance before creating the general code structure for the loop. Again, Kevin Markham does the same approach for his web scraping demonstration. We will do that here to make sure we have all the code for the loop.

#Testing with the first teacher and obtaining the name
test_result=results[0]
test_result.find('h5')
test_result.find('h5').text

#Obtaining the position(s)
test_result.find('p').contents[0].strip('\n\t')

#Obtaining the email
test_result.find('em').get_text()
 

To begin, we will define our first teacher as 'test_result'.

Name: By using the find method on the 'h5' tag, we get the line of code with our teacher's name. But this doesn't give us the name without the tags. We don't want the tags in our code. So to extract just the name text, we will add ".text" to the find method.

Position(s):  We will use the same find method as with the name, but this time our parameter will be the tag 'p'. Doing so gets us our position, but again we don't want the tags attached. Using .text again returns the following.....

'\n\tSpecial Education: Geometry, Geometry Modeling, SETSS'

This gave us more than we wanted. Specifically, we were given the string code for new line(\n) and tab(\t) at the beginning. Since our info is in a string, we can remove the parts we don't need using .strip('\n\t') with our line of code to remove these characters from anywhere in the string.

Email: Obtaining this information was much more straight forward. Again, using the find method with the 'em' tag as our parameter. Using the .get_text() method helps us with this since some of the emails are embedded in multiple 'em' tags.

Step 5: Create a loop to extract all desired entries

#Data extraction
info=[]
for result in results:
    name=result.find('h5').text
    position=result.find('p').contents[0].strip('\n\t')
    try: 
        email=result.find('em').get_text()
    except:
        email='NaN'
    info.append((name,position,email))

Now that we have experimented and fleshed out the code we needed to scrape out the info we needed, we will combine all the pieces in a loop. First, we will create an empty list defined as info to store the scraped information.

Looping through each teacher profile ("result"), we will give the variables "name", "position", and "email" for the respective lines of code that take the labeled information. We will then append that information into the info list as a single tuple containing that information.

When I ran the code the first time, the code with out the try, except set up, the loop stopped due to an error in the email. This is where the variable explorer in Spyder is valuable. On the name variable I saw the name where the email could not be contained. The email variable showed it currently contained the email of the previous teacher in the results list. Inspecting the webpage or the HTML code will reveal that "Ms. Willie" does not have an email address for her teacher profile.

To address this we will create a try, except set up to have her email and any other blanks entered as 'NaN'. Meanwhile everyone else will have their email scraped as normal. Running the code again allowed us to obtain all the entries successfully. Proof of this can be done by checking the length of the records list (it should return 47). If you have a variable explorer on your python editor as Spyder does it will hopefully reveal the dimensions of your record list as "47" (47 teachers).

Step 6: Creating the Data Frame and checking/cleaning it

#Convert data into a dataframe
import pandas as pd
df=pd.DataFrame(info, 
columns=['Name','Position(s)','Email'])

#Determining duplicates and the quanitity
for column in df.columns:
   print(df.duplicated([column]))
   print(df.duplicated([column]).sum())

#Eliminating duplicates
df.drop_duplicates(['Name'],keep='first', inplace=True)

We will convert the records list into a data frame so that we can export it to a csv with column the column names 'Name', 'Position(s)', and 'Email'.

As teachers tell a student to check their work before handing in an exam/essay, we should inspect our data before we try to do anything further. In our case, we should inspect the data frame to confirm all the data was extracted correctly. 

Surveying the webpage before hand revealed that there are duplicate photos of the same teachers. This implies that those teachers' names and emails were most likely duplicated in our data frame also. To confirm this in the data frame, we loop through the columns to check the duplicates and how many there are.

Duplicates in positions don't matter since there can be multiple teachers teaching the same subject. Looking at the boolean returns for 'Name' and 'Email' it is shown that there are 3 occurrences each of duplicates in the same row numbers.

Thus we will eliminate the duplicates by keeping the first entries for those teachers and return the data frame with the duplicates removed.

Step 7: Export to a csv file.

#Export to a csv file without numbered indices
df.to_csv('BronxSchoolStaffInfo.csv', index=False)

We will finally export data frame to a csv without indices. A copy of the csv can be found here. You can also go to the full github repository to see the full script.

Final Thoughts

Web scraping gives you that feeling of magic since you can pull info from any website once you find the tags you need. I hope that this walk through was helpful to those considering learning how to web scrap in Python.

Of course there is more clean up that could have been done prior to exporting the data frame to a csv. I am not going to include these since I want to focus on the web scraping aspect.

Options include:

1. Creating a gender column by splitting the names by their titles at the period then use pandas to map the title to the appropriate gender.

2. Separating the positions (since most of the teachers seem to teach more than one class).

We could then turn to Tableau/matplotlib for visualization and statistics to answer questions regarding those pieces of data compared to the teacher population versus other charter and public Bronx schools.

Acknowledgements

Thank you for taking the time to read this post. Again, be sure to check out the videos by Kevin Markham where I learned to web scrap here for another example with a video walk-through.

A special thanks to Tony Fischetti for the encouragement, mentorship, and help editing this post! Check out his blog also!

Feel free to comment on the post and the website. Any constructive critiques are welcomed.

Until next time,

-John DeJesus

 

Web Scraping with Python: Round 2

Web Scraping with Python: Round 2