« PreviousNext »

Introduction

The following tutorial describes how to scrape a webpage rendered by JavaScript using the Selenium module for Python. If you already know why you need to use a browser to retrieve all of the data from such a webpage, and are just looking to learn how to use Selenium, feel free to skip the first section.

The files containing all of the code that I use in this tutorial can be found here.

Background Information

When a browser wants to access a webpage, it sends a request to the server on which the files that make up the webpage are located. The server then sends a response consisting of the page’s source code back to the browser. The browser then interprets the HTML, CSS, etc, in the source code, allows any Javascript to run, and displays the page.

Of the parts that comprise the source code of a webpage, the Javascript code is of particular interest to us in this article. When a page is loaded into a browser, it becomes a document object, whose HTML elements, or element nodes, Javascript can access. As it runs, Javascript can create new HTML elements and append them to the document. A page on which this occurs is called a page rendered by Javascript. The new elements will not be present in the original source code of the page, the code that you see when you right-click and press “view page source”; but they will be present in the code that you see if you download the page as an HTML document and open it in a text editor, or in the code in the “elements” tab in your browser’s developer tools. This HTML code, which can be retrieved by JavaScript using the DOM's innerHTMLproperty, constitutes the code of the completed webpage that the browser displays after the Javascript has finished running, and has all of the data that you need for scraping.

Python cannot access this code without the support of a browser. When a Python library such as urllib or requests sends a request to a server, it receives the source code of the webpage, just like a browser does. However, Python cannot run Javascript and allow it to create the elements that hold the content that you need to scrape; the most it can do is parse the source code. In a situation like this, one in which you have to scrape content loaded dynamically by Javascript, content that is not present in the source code of the page, the Python module Selenium comes in handy.

What Is Selenium?

Selenium is a module that allows you to access a web browser through Python. With Selenium, you can use Python code to open a web browser, navigate to a page, log in (if needed), and return the page's inner HTML, from which you can then scrape the data you need. In the following, I will describe how to do each of these steps.

Installing and Importing Selenium

Step 1: Install Selenium

Do

pip install selenium

in the command line. (Pip is Python's package manager. If you don't have it, be sure to get it, because it allows one-step installation of most packages.)

Step 2: Use Selenium to open a web browser and navigate to a page

Whatever browser you choose to use, make sure that you have it already installed on your computer. Firefox is most commonly used with Selenium, but you can use others if you install the proper web driver and put it in your working directory. For example, if you want to use Chrome, the browser that I will be using in this example, download chromedriver.exe and put it in the folder where your Python script is. Webdrivers for other browsers are available here.

Use the following code to initialize the browser object and go to a URL:

from selenium import webdriver


browser = webdriver.Chrome() #replace with .Firefox(), or with the browser of your choice
url = "http://example.com/login.php"
browser.get(url) #navigate to the page

You’ll see a browser window open and go to the page. Now, suppose you have to log in to a site to get to the pages that you need to scrape. (If you don’t, feel free to skip this step.) Thankfully, this is easy to do with Selenium.

Logging In to a Page

Step 1: Get the IDs of the form fields used for login

Go to the login page, right-click on the “username” (or email, etc) form field, and click “inspect” or “inspect element.” The developer tools box, whose “elements” tab shows the HTML of the form, will pop up from the bottom of the window. If it’s not highlighted already, find the element that corresponds to the “username” form field. You’ll need the value of its “id” attribute for the next step.

Repeat these steps to find the ID of the password form field, and that of any other form fields you need to fill in to log in. Also find the ID of the submit button.

Step 2: Send your login information to the website

In your Python code, use the browser to post your username/password data to the website like so:

username = browser.find_element_by_id("username_id") #username form field
password = browser.find_element_by_id("password_id") #password form field

username.send_keys("my_username")
password.send_keys("my_password")

submitButton = browser.find_element_by_id("submit_button_id") 
button.click() 

In the browser, you should see the username/password form fields filled in with your username and password, and then the page should redirect to the page that you see when you log in. Now, you can access all of the pages behind the login. To do so, just call the .get() method on your browser object again, but put the url of the page that you want to navigate to as the parameter.

Retrieving the inner HTML

To get the inner HTML of a page, and return it as a string to your Python script, use the .execute_script() method. This method takes as a parameter a string of Javascript code that it executes inside of the browser. Call it like so to get the inner HTML of a page:

browser.get("http://example.com/page.php") #navigate to page behind login
innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string

You can print the variable innerHTML to verify that it has all of the data that you need. If it does, then you’ve successfully retrieved the page’s inner HTML! You can now parse it using your favorite HTML parsing library. I prefer lxml, which I will describe how to use in the next tutorial.
Happy scraping!

« PreviousNext »