NLTK Exercises
Problem 1: Computers cannot read paragraphs; they process lists of strings. Take a short paragraph of text. First, split the paragraph into individual sentences. Then, split the very first sentence into individual words (tokens).
nltk-exercises.py
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
# Download the required punctuation data (only needed once)
nltk.download('punkt_tab', quiet=True)
# 1. The raw text
text = "NLTK is amazing! It makes processing text easy. Are you having fun?"
# 2. Split the text into sentences
sentences = sent_tokenize(text)
# 3. Split the first sentence into words
first_sentence_words = word_tokenize(sentences[0])
print("Sentences:
", sentences)
print("
Words from the first sentence:
", first_sentence_words)
Problem 2: You have a sentence: “The data scientist loves data because data is the best.” Find out what the most common meaningful words are. To do this, you must convert the text to lowercase, remove punctuation, remove “stop words” (common filler words like “the”, “is”, “because”), and then count the remaining words.
nltk-exercises.py
import nltk
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
# 1. The raw text
text = "The data scientist loves data because data is the best."
# 2. Tokenize and convert to lowercase
words = word_tokenize(text.lower())
# 3. Load English stop words
stop_words = set(stopwords.words("english"))
# 4. Filter out stop words and punctuation (using .isalnum() to check for letters/numbers)
cleaned_words = [word for word in words if word.isalnum() and word not in stop_words]
# 5. Calculate frequency
frequency = FreqDist(cleaned_words)
print("Cleaned Words:", cleaned_words)
print("
Most Common Words:", frequency.most_common(2))
Problem 3: Take the sentence: “The quick brown fox jumps over the lazy dog.” Use NLTK to automatically tag each word with its grammatical Part of Speech (e.g., Noun, Verb, Adjective).
nltk-exercises.py
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab', quiet=True)
nltk.download('averaged_perceptron_tagger_eng', quiet=True)
# 1. The raw text
sentence = "The quick brown fox jumps over the lazy dog."
# 2. Tokenize the text into words
words = word_tokenize(sentence)
# 3. Apply Part-of-Speech tagging
tagged_words = nltk.pos_tag(words)
print("Tagged Words:")
for word, tag in tagged_words:
print(f"{word}: {tag}")