The files containing all of the code that I use in this tutorial can be found here.
What Is Selenium?
Selenium is a module that allows you to access a web browser through Python. With Selenium, you can use Python code to open a web browser, navigate to a page, log in (if needed), and return the page's inner HTML, from which you can then scrape the data you need. In the following, I will describe how to do each of these steps.
Installing and Importing Selenium
Step 1: Install Selenium
pip install selenium
in the command line. (Pip is Python's package manager. If you don't have it, be sure to get it, because it allows one-step installation of most packages.)
Step 2: Use Selenium to open a web browser and navigate to a page
Whatever browser you choose to use, make sure that you have it already installed on your computer. Firefox is most commonly used with Selenium, but you can use others if you install the proper web driver and put it in your working directory. For example, if you want to use Chrome, the browser that I will be using in this example, download chromedriver.exe and put it in the folder where your Python script is. Webdrivers for other browsers are available here.
Use the following code to initialize the browser object and go to a URL:
from selenium import webdriver browser = webdriver.Chrome() #replace with .Firefox(), or with the browser of your choice url = "http://example.com/login.php" browser.get(url) #navigate to the page
You’ll see a browser window open and go to the page. Now, suppose you have to log in to a site to get to the pages that you need to scrape. (If you don’t, feel free to skip this step.) Thankfully, this is easy to do with Selenium.
Logging In to a Page
Step 1: Get the IDs of the form fields used for login
Go to the login page, right-click on the “username” (or email, etc) form field, and click “inspect” or “inspect element.” The developer tools box, whose “elements” tab shows the HTML of the form, will pop up from the bottom of the window. If it’s not highlighted already, find the element that corresponds to the “username” form field. You’ll need the value of its “id” attribute for the next step.
Repeat these steps to find the ID of the password form field, and that of any other form fields you need to fill in to log in. Also find the ID of the submit button.
Step 2: Send your login information to the website
In your Python code, use the browser to post your username/password data to the website like so:
username = browser.find_element_by_id("username_id") #username form field password = browser.find_element_by_id("password_id") #password form field username.send_keys("my_username") password.send_keys("my_password") submitButton = browser.find_element_by_id("submit_button_id") button.click()
In the browser, you should see the username/password form fields filled in with your username and password, and then the page should redirect to the page that you see when you log in. Now, you can access all of the pages behind the login. To do so, just call the .get() method on your browser object again, but put the url of the page that you want to navigate to as the parameter.
Retrieving the inner HTML
browser.get("http://example.com/page.php") #navigate to page behind login innerHTML = browser.execute_script("return document.body.innerHTML") #returns the inner HTML as a string
You can print the variable innerHTML to verify that it has all of the data that you need. If it does, then you’ve successfully retrieved the page’s inner HTML! You can now parse it using your favorite HTML parsing library. I prefer lxml, which I will describe how to use in the next tutorial.