Movie Review Sentiment Analyzer (NLP Project)
Project Overview & Use Case
Understanding human language is one of the hardest tasks for a computer. Keras makes it incredibly accessible by providing high-level, easy-to-use building blocks for Deep Learning.
The Use Case: Imagine you work for a film studio or a product company. You receive thousands of user reviews daily. Reading them manually to figure out if people are happy or angry is impossible. You need an AI that can read text and automatically score it as “Positive” or “Negative” (a process called Sentiment Analysis).
The Output: This script downloads a famous dataset of 50,000 IMDB movie reviews. It translates the English words into numbers, trains a Keras Neural Network to understand the emotional weight of those words, and tests the AI by having it read a review and guess its sentiment.
System Workflow (How It Works)
Data Loading & Tokenization: Computers cannot do math on words like “terrible” or “amazing.” Keras automatically loads the IMDB dataset where every word has already been converted to a unique ID number (e.g., “the” = 1, “amazing” = 45).
Padding Sequences: Neural networks expect input data to be uniform in size. Since reviews are all different lengths, we use Keras to “pad” short reviews with zeros at the end so every review is exactly 250 words long.
Building the Network:
-
Embedding Layer: This is the magic of NLP. It plots every word on a multi-dimensional digital map. Words with similar emotional meanings (like “bad” and “awful”) are clustered closer together.
-
Dense Layers: These layers look at the combined meaning of all the words in the review and try to make a decision.
-
Output Layer: A single neuron using a sigmoid activation function outputs a percentage between 0 (100% Negative) and 1 (100% Positive).
Decoding and Testing: The script grabs a random review from the test set, translates the numbers back into English text so you can read it, and then shows you the AI’s emotional analysis.
Source Code
import tensorflow as tf
from tensorflow import keras
import numpy as np
import random
def build_nlp_model():
"""Loads text data, preprocesses it, and trains a Sentiment Analysis AI."""
print("📥 Downloading IMDB Movie Reviews dataset...")
# Keep only the top 10,000 most frequently occurring words in the training data
vocab_size = 10000
# Load the data. It comes pre-tokenized (words are already numbers)
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=vocab_size)
# 1. Preprocessing: Pad sequences
# Some reviews are 50 words, some are 500. We force them all to be 250 words.
print("⚙️ Padding sequences to uniform lengths...")
max_length = 250
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_length, padding='post')
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_length, padding='post')
# 2. Build the Keras Model
print("🧠 Building the Keras NLP Neural Network...")
model = keras.Sequential([
# Embedding layer learns the 'meaning' of words based on their context
keras.layers.Embedding(input_dim=vocab_size, output_dim=16, input_length=max_length),
# Shrinks the 2D data down to 1D, making it easier for the next layer to process
keras.layers.GlobalAveragePooling1D(),
# A standard hidden layer to find patterns
keras.layers.Dense(16, activation='relu'),
# Output layer: 1 neuron.
# Sigmoid function outputs a probability between 0.0 (Negative) and 1.0 (Positive)
keras.layers.Dense(1, activation='sigmoid')
])
# 3. Compile the Model
# Binary crossentropy is the standard loss function for Yes/No, True/False outputs
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
# 4. Train the Model
print("
🏋️♂️ Training the AI on 25,000 reviews (This takes a moment)...")
# We use a validation split to ensure the AI isn't just memorizing the answers
history = model.fit(x_train, y_train, epochs=3, batch_size=512, validation_split=0.2, verbose=1)
print("
📊 Evaluating on unseen test data...")
test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)
print(f"🔹 Final AI Accuracy on brand new reviews: {test_acc * 100:.2f}%")
return model, x_test, y_test
def decode_and_test(model, x_test, y_test):
"""Picks a random test review, decodes it to English, and prints the AI's prediction."""
print("
" + "="*50)
print("🎬 LIVE AI SENTIMENT TEST")
print("="*50)
# 1. Get the dictionary that maps numbers back to English words
word_index = keras.datasets.imdb.get_word_index()
# The IMDB dataset offsets word indices by 3 to reserve space for special characters
reverse_word_index = {value + 3: key for key, value in word_index.items()}
reverse_word_index[0] = "<PAD>" # Padding
reverse_word_index[1] = "<START>"
reverse_word_index[2] = "<UNKNOWN>"
def decode_review(encoded_text):
return ' '.join([reverse_word_index.get(i, '?') for i in encoded_text])
# 2. Pick a random review from the test set
idx = random.randint(0, len(x_test) - 1)
raw_review = x_test[idx]
actual_sentiment = "Positive" if y_test[idx] == 1 else "Negative"
# 3. Print the readable text
english_review = decode_review(raw_review)
# Clean up the padding tags so it is easier to read
english_review = english_review.replace("<PAD> ", "").replace("<START> ", "")
print(f"
📖 RANDOM REVIEW TEXT:
"{english_review[:400]}... [truncated]"")
# 4. Ask the AI to predict the sentiment
# We must reshape the data slightly for a single prediction
prediction_data = np.expand_dims(raw_review, axis=0)
ai_confidence = model.predict(prediction_data, verbose=0)[0][0]
# If the score is > 0.5, it leans positive. If < 0.5, it leans negative.
ai_prediction = "Positive" if ai_confidence >= 0.5 else "Negative"
print("
" + "-"*50)
print(f"🎯 ACTUAL SENTIMENT: {actual_sentiment}")
print(f"🤖 AI PREDICTION: {ai_prediction} (Score: {ai_confidence:.4f})")
if actual_sentiment == ai_prediction:
print("✅ The AI understood the emotion correctly!")
else:
print("❌ The AI was confused by the wording.")
print("="*50 + "
")
if __name__ == "__main__":
trained_model, test_reviews, test_labels = build_nlp_model()
decode_and_test(trained_model, test_reviews, test_labels)
Code Explanation (Keras Concepts)
keras.datasets: Keras includes several built-in datasets to practice with. imdb.load_data() is specifically designed for NLP testing. It handles downloading and splitting the data into training and testing sets automatically.
keras.preprocessing.sequence.pad_sequences: Neural networks are essentially massive math equations. A math equation expects a consistent number of variables. If one review has 10 words and another has 100, the network will crash. pad_sequences standardizes the input size.
keras.layers.Embedding: This is one of Keras’ most powerful NLP layers. Instead of treating words as completely isolated items, it plots them in a multi-dimensional space. The AI literally learns that “brilliant” and “fantastic” are geometrically close to each other, giving it an understanding of context.
Binary Cross-Entropy: In our previous project (Digit Recognizer), the AI had to choose between 10 different numbers, so we used Sparse Categorical cross-entropy. Here, there are only two possible answers (Positive or Negative), so we must use Binary cross-entropy as our loss function.
Sigmoid Activation: The final layer uses a sigmoid function. This creates an “S-shaped” curve that takes any number the network calculates and squashes it into a neat probability percentage between exactly 0.0 and 1.0.
Execution Guide
Ensure TensorFlow is installed: Open your terminal and verify you have it installed: pip install tensorflow numpy
Save the file: Create a new Python file named sentiment_analyzer.py and paste the code above.
Run the script: Execute the script from your terminal: python sentiment_analyzer.py
Observe the Output: Watch the terminal as Keras compiles the architecture and trains the model. Once it finishes its 3 training epochs, it will grab a random movie review, translate the numbers back into readable English, and print out exactly what emotion the AI detected in the text!