Automated Data Extractor

Project Overview & Use Case

Not every website provides a clean, easily accessible API to download their data. Often, the data you want (like product prices, news headlines, or real estate listings) is trapped inside the visual layout of a webpage. Web Scraping is the process of writing a bot to download the webpage and extract the exact text you need.

The Use Case: Imagine you are a content creator looking to build a database of inspirational quotes to post daily on social media. Copying and pasting them one by one from a website is incredibly tedious.

The Output: This script acts as an automated bot. It visits a safe sandbox website designed specifically for scraping (quotes.toscrape.com), downloads the raw HTML code, uses Beautiful Soup to parse through the messy web code, isolates the quotes, authors, and categories, and saves everything neatly into a .csv spreadsheet.

System Workflow (How It Works)

The HTTP Request: Beautiful Soup cannot browse the internet itself. We use the requests library to act like a web browser, pinging the server and asking it to send back the raw HTML code of the website.

Making the “Soup”: We feed that raw, messy HTML text into Beautiful Soup. It instantly organizes it into a searchable “tree” structure (the Document Object Model, or DOM).

Targeted Extraction (find_all): We instruct the bot to look for specific HTML tags and CSS classes. For example, we tell it to find every < div> tag that has the class “quote”.

Refining the Data (find and .text): Once we have isolated a single quote block, we dig deeper to find the exact < span> holding the text and the < small> tag holding the author’s name, stripping away the HTML code to leave only pure English text.

Storage: The script takes the extracted pure text and writes it row-by-row into a CSV file.

Source Code

web_scraper.py


import requests
from bs4 import BeautifulSoup
import csv
import os

def scrape_quotes_website(url="http://quotes.toscrape.com"):
  """
  Scrapes quotes, authors, and tags from the target URL and saves them to a CSV.
  """
  print(f"🌐 Connecting to {url}...")
  
  # 1. Fetch the webpage using the requests library
  response = requests.get(url)
  
  # Check if the website blocked us or is down (200 means OK)
  if response.status_code != 200:
      print(f"❌ Failed to retrieve page. Status code: {response.status_code}")
      return

  print("✅ Connection successful! Parsing HTML...")

  # 2. Feed the raw HTML into Beautiful Soup
  soup = BeautifulSoup(response.text, 'html.parser')

  # 3. Find all the blocks of HTML that contain a quote
  # On this specific website, quotes are stored in <div class="quote">
  quote_blocks = soup.find_all('div', class_='quote')
  
  scraped_data = []

  print(f"🔍 Found {len(quote_blocks)} quotes on the page. Extracting details...")

  # 4. Loop through every block and extract the specific pieces of text
  for block in quote_blocks:
      
      # Find the text of the quote (stored in <span class="text">)
      # .text strips away the <span> tags and leaves only the words
      quote_text = block.find('span', class_='text').text
      
      # Find the author (stored in <small class="author">)
      author = block.find('small', class_='author').text
      
      # Find all the tags for this quote (stored in <a class="tag">)
      # Because there are multiple tags per quote, we use find_all and loop through them
      tag_elements = block.find_all('a', class_='tag')
      tags = [tag.text for tag in tag_elements]
      tags_string = ", ".join(tags) # Combine list into a single comma-separated string
      
      # Add the clean data to our master list
      scraped_data.append([quote_text, author, tags_string])

  # 5. Save the extracted data to a CSV spreadsheet
  csv_filename = "scraped_quotes.csv"
  print(f"💾 Saving data to {csv_filename}...")
  
  with open(csv_filename, mode='w', newline='', encoding='utf-8') as file:
      writer = csv.writer(file)
      # Write the headers
      writer.writerow(['Quote', 'Author', 'Tags'])
      # Write the data
      writer.writerows(scraped_data)
      
  print("🎉 Scraping complete! Open the CSV file to view your data.")

if __name__ == "__main__":
  print("=== Beautiful Soup Web Scraper ===")
  scrape_quotes_website()

Code Explanation (Beautiful Soup Concepts)

requests.get(url): Beautiful Soup only parses data; it doesn’t download it. The requests library is the industry standard for making HTTP calls. response.text gives us the giant wall of raw HTML code that builds the webpage.

BeautifulSoup(html, ‘html.parser’): This is the core of the library. It takes the messy wall of text and parses it into a structured, easily searchable Python object (the “Soup”).

.find_all(‘tag’, class_=‘name’): This searches the entire webpage and returns a Python List of every matching element. Note the underscore in class_—because class is a protected keyword in Python, Beautiful Soup uses class_ to search for CSS classes.

.find(‘tag’, class_=‘name’): Similar to find_all, but it only returns the very first matching element it sees. Once we isolate a single quote block, we use .find() to grab its specific author.

.text: This is the most satisfying part of scraping. An element might look like this: “The world is beautiful.”. By adding .text to the end of your Beautiful Soup search, it instantly deletes all the HTML brackets and returns just the string “The world is beautiful.”

6. Execution Guide

Install Requirements: Open your terminal or command prompt. You need to install both Beautiful Soup and Requests: pip install beautifulsoup4 requests

Save the file: Create a new Python file named web_scraper.py and paste the provided code.

Run the script: Navigate to the folder in your terminal and execute: python web_scraper.py

Review Output: You will see the bot connect to the website, report how many quotes it found, and save them. Look in the exact same folder where your Python script is saved, and you will see a newly generated scraped_quotes.csv file. You can open this file in Excel, Google Sheets, or Apple Numbers to view your neatly scraped data!