Scikit-learn Exercises
Problem 1: Machine learning models often perform poorly if the data features are on completely different scales (e.g., Age ranging from 20-60, but Salary ranging from $40,000-$120,000). Create a small dataset and use StandardScaler to normalize the data so the features are on a level playing field.
from sklearn.preprocessing import StandardScaler
import numpy as np
# 1. Create dummy data: [Age, Salary]
# Notice how Salary is much larger than Age
data = np.array([
[25, 50000],
[30, 80000],
[45, 120000],
[22, 45000]
])
# 2. Initialize the scaler
scaler = StandardScaler()
# 3. Fit the scaler to the data and transform it
scaled_data = scaler.fit_transform(data)
print("Original Data:
", data)
print("
Scaled Data (Mean ~0, Variance 1):
", scaled_data)
Problem 2: Create a simple 2D array representing “Years of Experience” and a 1D array representing “Salary”. Train a LinearRegression model to learn the relationship between the two, and then predict the expected salary for a new employee with exactly 5 years of experience.
from sklearn.linear_model import LinearRegression
import numpy as np
# 1. Create the data
# Scikit-learn expects X (features) to be a 2D array, and y (target) to be 1D
experience = np.array([[1], [2], [3], [4], [6]])
salary = np.array([40000, 50000, 60000, 70000, 90000])
# 2. Initialize the model
model = LinearRegression()
# 3. Train (fit) the model
model.fit(experience, salary)
# 4. Predict for a new value (5 years)
new_employee = np.array([[5]])
predicted_salary = model.predict(new_employee)
print(f"Predicted Salary for 5 years of experience: $ {predicted_salary[0]:,.2f}")
Problem 3: Load scikit-learn’s built-in Iris flower dataset. You need to split this data so that 80% is used to train a K-Nearest Neighbors (KNN) classification model, and 20% is held back to test it. Predict the species of the test flowers and calculate the model’s accuracy.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
# 1. Load the dataset (X = measurements, y = flower species)
X, y = load_iris(return_X_y=True)
# 2. Split the data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Initialize and train the model
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)
# 4. Make predictions on the unseen test data
predictions = model.predict(X_test)
# 5. Evaluate how many it got right
accuracy = accuracy_score(y_test, predictions)
print(f"Model Accuracy: {accuracy * 100:.2f}%")