Smart Article Summarizer

Project Overview & Use Case

The Use Case: We live in the information age, but nobody has the time to read massive 10-page articles, research papers, or news reports. You need a tool that can read a wall of text and automatically extract a short, punchy summary of the most important points.

The Output: This script uses an Extractive Summarization technique. It reads a long piece of text, breaks it down into individual words and sentences, removes useless filler words (like “the” or “is”), calculates which remaining keywords are the most statistically significant, and then builds a summary using the sentences that contain those high-value keywords.

System Workflow (How It Works)

Data Ingestion: The script takes a long, multi-paragraph string of text.

Tokenization: It uses NLTK to chop the massive text block into two lists: a list of individual sentences, and a list of individual words.

Data Cleaning: It filters out punctuation and Stopwords (common filler words like “a”, “an”, “the”, “in” that carry no real meaning).

Frequency Distribution: It counts how many times the meaningful words appear. If the word “Algorithm” appears 10 times, it gets a high mathematical weight.

Sentence Scoring: The script looks at every sentence. If a sentence contains a lot of high-weight words, that sentence gets a high score.

Extraction: It grabs the top 3 highest-scoring sentences and prints them out as the final summary.

Source Code

article_summarizer.py


import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
import string
import heapq

# --- Initial Setup: Download required NLTK datasets ---
# NLTK requires downloading specific dictionaries the first time it runs
def download_nltk_data():
  try:
      nltk.data.find('tokenizers/punkt')
  except LookupError:
      print("📥 Downloading NLTK tokenizer data...")
      nltk.download('punkt', quiet=True)
      
  try:
      nltk.data.find('corpora/stopwords')
  except LookupError:
      print("📥 Downloading NLTK stopwords data...")
      nltk.download('stopwords', quiet=True)

def summarize_text(text, num_sentences=3):
  """
  Summarizes a block of text by extracting the most statistically significant sentences.
  """
  print("
⚙️ Processing text with NLTK...")

  # 1. Tokenize the text into sentences and words
  sentences = sent_tokenize(text)
  words = word_tokenize(text.lower()) # Convert to lowercase for accurate counting

  # 2. Get the list of English "stopwords" and punctuation
  stop_words = set(stopwords.words("english"))
  punctuation = set(string.punctuation)

  # 3. Clean the words (Remove stopwords and punctuation)
  cleaned_words = []
  for word in words:
      if word not in stop_words and word not in punctuation:
          cleaned_words.append(word)

  # 4. Calculate the frequency of each remaining word
  word_frequencies = FreqDist(cleaned_words)
  
  # Find the maximum frequency to normalize the scores (scale them between 0 and 1)
  maximum_frequency = max(word_frequencies.values())
  for word in word_frequencies.keys():
      word_frequencies[word] = (word_frequencies[word] / maximum_frequency)

  # 5. Score the sentences based on the words they contain
  sentence_scores = {}
  for sentence in sentences:
      sentence_words = word_tokenize(sentence.lower())
      for word in sentence_words:
          if word in word_frequencies.keys():
              # If the sentence is too long, skip it so we don't just pick the longest sentences
              if len(sentence_words) < 30: 
                  if sentence not in sentence_scores:
                      sentence_scores[sentence] = word_frequencies[word]
                  else:
                      sentence_scores[sentence] += word_frequencies[word]

  # 6. Extract the top N highest-scoring sentences
  print(f"📊 Extracting the top {num_sentences} most important sentences...")
  summary_sentences = heapq.nlargest(num_sentences, sentence_scores, key=sentence_scores.get)
  
  # Join them back into a single paragraph
  summary = " ".join(summary_sentences)
  return summary

if __name__ == "__main__":
  # Ensure NLTK data is ready
  download_nltk_data()

  # A sample long article to summarize
  SAMPLE_ARTICLE = """
  Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. 
  Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals. 
  Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving", however this definition is rejected by major AI researchers.
  AI applications include advanced web search engines, recommendation systems, understanding human speech, self-driving cars, automated decision-making and competing at the highest level in strategic game systems. 
  As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. 
  For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology.
  The field was founded on the assumption that human intelligence can be so precisely described that a machine can be made to simulate it.
  """

  print("=== NLTK Article Summarizer ===")
  print("
📜 ORIGINAL TEXT LENGTH:", len(SAMPLE_ARTICLE), "characters")
  
  # Run the summarizer
  final_summary = summarize_text(SAMPLE_ARTICLE, num_sentences=3)
  
  print("-" * 50)
  print("✨ GENERATED SUMMARY:")
  print("-" * 50)
  print(final_summary)
  print("-" * 50)
  print(f"📉 NEW TEXT LENGTH: {len(final_summary)} characters")
  print("
✅ Summary successfully generated!")

Code Explanation (NLTK Concepts)

Tokenization (word_tokenize and sent_tokenize): A computer sees text as one giant, unbroken string of characters. Tokenization is the process of safely chopping that string up. sent_tokenize is incredibly smart—it knows the difference between a period ending a sentence (“The car is red.”) and a period used in an abbreviation (“Dr. Smith went to Washington, D.C.”).

Stopwords (stopwords.words(“english”)): Words like “is”, “the”, “of”, and “and” make up about 50% of all English text, but they carry absolutely zero analytical meaning. NLTK provides a built-in dictionary of these words so we can mathematically filter them out of our data before doing analysis.

Frequency Distribution (FreqDist): This is a specialized NLTK dictionary that instantly counts occurrences. If you pass it a list of words, it calculates exactly how many times each word appears. In our code, we divide by the maximum_frequency to normalize the scores (so the most popular word is perfectly 1.0, and everything else is a fraction).

heapq.nlargest: This is a built-in Python module (not NLTK, but very useful here). Instead of sorting a massive dictionary of 10,000 sentences from top to bottom (which is slow), heapq acts like a filter that rapidly catches only the top N highest values, making the script highly optimized.

Execution Guide

Install Requirements: Open your terminal or command prompt. pip install nltk

Save the file: Create a new Python file named article_summarizer.py and paste the provided code.

Run the script: Navigate to the folder in your terminal and execute: python article_summarizer.py

Review Output: The very first time you run this, you will see a quick message indicating it is downloading the dictionary files. Then, it will process the long Wikipedia excerpt about AI, throw away the filler, calculate the math, and print out a perfect 3-sentence summary of the core concepts!