Customer Churn Predictor (Machine Learning Project)

Project Overview & Use Case

Machine Learning is all about teaching a computer to recognize patterns in data so it can make decisions on its own. Scikit-learn provides the tools to build, train, and evaluate these predictive models.

The Use Case: Imagine you work for a subscription-based company (like Netflix or a telecom provider). Losing customers (“churning”) is expensive. If you can predict which customers are likely to cancel their subscriptions before they actually do, the marketing team can offer them targeted discounts to stay.

The Output: This script generates a dataset of mock customer behavior, trains a Machine Learning algorithm (a Random Forest) to learn the signs of a churning customer, evaluates how accurate the AI is, and then makes predictions on brand-new customers.

System Workflow (How It Works)

Data Generation: Since we don’t have a real company database, we use Scikit-learn’s built-in make_classification tool to instantly generate 1,000 mock customer profiles. Each profile has features (like monthly charges, support tickets) and a label (0 = Stayed, 1 = Churned).

Train/Test Split: A golden rule of Machine Learning is that you never test your AI on the data it used to study. The script splits the data: 80% is used for training, and 20% is hidden away for the final test.

Model Training: We initialize a RandomForestClassifier and feed it the training data using the .fit() command. The AI mathematically figures out the relationship between customer behavior and cancellation.

Evaluation: We force the AI to make predictions on the hidden 20% of the data using .predict(), and grade it by comparing its guesses to the actual reality using an accuracy score and a detailed report.

Live Prediction: Finally, we feed the AI a completely new, made-up customer to see what it predicts.

Source Code

churn_predictor.py

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import numpy as np

def generate_mock_data():
  """Generates a synthetic dataset of 1,000 customers."""
  print("⚙️ Generating synthetic customer data...")
  # Features might represent: [MonthlyCharge, SupportTickets, MonthsActive, AddOns]
  X, y = make_classification(
      n_samples=1000,       # 1,000 customers
      n_features=4,         # 4 pieces of data per customer
      n_informative=3,      # 3 features actually matter for predicting churn
      n_redundant=1,        # 1 feature is useless noise
      random_state=42,      # Ensures we get the same data every time
      weights=[0.7, 0.3]    # 70% of customers stay (0), 30% churn (1)
  )
  return X, y

def build_and_evaluate_model():
  """Trains an AI model to predict customer churn."""
  
  # 1. Get the data (X = Features/Data, y = Labels/Answers)
  X, y = generate_mock_data()

  # 2. Split the data into Training (80%) and Testing (20%) sets
  print("🪓 Splitting data into training and testing sets...")
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

  # 3. Initialize the Machine Learning Algorithm
  print("🧠 Training the Random Forest AI model...
")
  model = RandomForestClassifier(n_estimators=100, random_state=42)
  
  # 4. Train the model (The AI learns the patterns here)
  model.fit(X_train, y_train)

  # 5. Make predictions on the hidden test set
  predictions = model.predict(X_test)

  # 6. Evaluate the results
  print("-" * 40)
  print("📊 AI PERFORMANCE REPORT")
  print("-" * 40)
  
  # Calculate simple accuracy
  acc = accuracy_score(y_test, predictions)
  print(f"🔹 Overall Accuracy: {acc * 100:.2f}%
")
  
  # Print a detailed report (Precision, Recall, F1-Score)
  print("Detailed Metrics:")
  # target_names makes the output much easier for humans to read
  print(classification_report(y_test, predictions, target_names=['Stayed (0)', 'Churned (1)']))
  print("-" * 40)
  
  return model

def predict_new_customer(model):
  """Feeds brand new data into the trained model."""
  print("
🔮 PREDICTING A NEW CUSTOMER'S BEHAVIOR")
  
  # Let's invent a new customer with 4 random feature values
  new_customer_data = np.array([[1.5, -0.2, 3.1, -1.0]])
  
  # Ask the AI for its prediction
  prediction = model.predict(new_customer_data)
  
  if prediction[0] == 1:
      print("🚨 WARNING: The AI predicts this customer will CHURN! Send them a discount code.")
  else:
      print("✅ SAFE: The AI predicts this customer will STAY. No action needed.")

if __name__ == "__main__":
  print("=== Welcome to the Machine Learning Churn Predictor ===")
  
  # Train the model and get the trained AI back
  trained_model = build_and_evaluate_model()
  
  # Use the trained AI to predict the future of a new customer
  predict_new_customer(trained_model)
  

Code Explanation (Scikit-Learn Concepts)

X and y: In Machine Learning, capital X always represents your matrix of features (the data you know, like age, price, clicks), and lowercase y represents the target vector (the answer you are trying to predict, like True/False, Dog/Cat).

train_test_split: This is one of the most frequently used functions in data science. It shuffles your dataset and divides it. Without this, if you tested the AI on the exact data it trained on, it would just memorize the answers (a problem called overfitting), and fail in the real world.

RandomForestClassifier: This is an algorithm that creates hundreds of digital “decision trees” (like flowcharts) based on your data. They all “vote” on whether a customer will churn, and the majority vote wins. It is highly accurate and robust.

.fit() and .predict(): This is the standard API of almost all Scikit-learn models. .fit(X, y) teaches the model. .predict(X) asks the trained model to guess the answers for new data.

classification_report: Accuracy alone can be misleading. This function breaks down how good the AI is at specifically finding the “Churners” versus how good it is at finding the “Stayers.”

Execution Guide

Install Requirements: Open your terminal and run: pip install scikit-learn numpy

Save the file: Create a new Python file named churn_predictor.py and paste the provided code.

Run the script: Execute the script from your terminal: python churn_predictor.py

Review Output: Watch the terminal as the script generates data, trains the model, prints out the accuracy metrics (it should be roughly 80-90% accurate based on the synthetic data), and evaluates the brand-new hypothetical customer!