Build a Web Scraper With Python
Working through this project will give you the knowledge of the process and tools you need to scrape any static website out there on the World Wide Web
Inspect the Site Using Developer Tools:
Click through the site and interact with it just like any typical job searcher would. For example, you can scroll through the main page of the website
Scrape HTML Content From a Page:
For this task, you’ll use Python’s requests library.
Create a virtual environment for your project before you install any external package.
Activate your new virtual environment, then type the following command in your terminal to install the external requests library
pip install requests
import requests URL = "https://realpython.github.io/fake-jobs/" page = requests.get(URL)
This code issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in a Python object.
Hidden Websites:
Some pages contain information that’s hidden behind a login. That means you’ll need an account to be able to scrape anything from the page. The process to make an HTTP request from your Python script is different from how you access a page from your browser. Just because you can log in to the page through your browser doesn’t mean you’ll be able to scrape it with your Python script.
Parse HTML Code With Beautiful Soup :
Beautiful Soup is a Python library for parsing structured data. It allows you to interact with HTML in a similar way to how you interact with a web page using developer tools. The library exposes a couple of intuitive functions you can use to explore the HTML you received. To get started, use your terminal to install Beautiful Soup:
pip install beautifulsoup4
import requests from bs4 import BeautifulSoup URL = "https://realpython.github.io/fake-jobs/" page = requests.get(URL) soup = BeautifulSoup(page.content, "html.parser")
Find Elements by HTML Class Name:
job_elements = results.find_all(“div”, class_=”card-content”)
Extract Text From HTML Elements:
for job_element in job_elements:
title_element = job_element.find("h2", class_="title")
company_element = job_element.find("h3", class_="company")
location_element = job_element.find("p", class_="location")
print(title_element.text.strip())
print(company_element.text.strip())
print(location_element.text.strip())
print()
Conclusion:
A readable list of jobs that also includes the company name and each job’s location. However, you’re looking for a position as a software developer, and these results contain job postings in many other fields as well.
