Notebook 1
Exercise 1¶
Exploring Class Separability in 2D¶
Understanding how data is distributed is the first step before designing a network architecture. In this exercise, you will generate and visualize a two-dimensional dataset to explore how data distribution affects the complexity of the decision boundaries a neural network would need to learn.
Instructions¶
- Generate the Data: Create a synthetic dataset with a total of 400 samples, divided equally among 4 classes (100 samples each). Use a Gaussian distribution to generate the points for each class based on the following parameters:
- Class 0: Mean = [2, 3], Standard Deviation = [0.8, 2.5]
- Class 1: Mean = [5, 6], Standard Deviation = [1.2, 1.9]
- Class 2: Mean = [8, 1], Standard Deviation = [0.9, 0.9]
- Class 3: Mean = [15, 4], Standard Deviation = [0.5, 2.0]
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
class0 = {
"x": np.random.normal(2, 0.8, size=100),
"y": np.random.normal(3, 2.5, size=100)
}
class1 = {
"x": np.random.normal(5, 1.2, size=100),
"y": np.random.normal(6, 1.9, size=100)
}
class2 = {
"x": np.random.normal(8, 0.9, size=100),
"y": np.random.normal(1, 0.9, size=100)
}
class3 = {
"x": np.random.normal(15, 0.5, size=100),
"y": np.random.normal(4, 2.0, size=100)
}
plt.figure(figsize=(8,6))
plt.scatter(class0["x"], class0["y"], label="Class 0", alpha=0.6)
plt.scatter(class1["x"], class1["y"], label="Class 1", alpha=0.6)
plt.scatter(class2["x"], class2["y"], label="Class 2", alpha=0.6)
plt.scatter(class3["x"], class3["y"], label="Class 3", alpha=0.6)
plt.legend()
plt.xlabel("X1")
plt.ylabel("X2")
plt.title("Dados Sinteticos Aleatórios")
plt.show()
Plot the Data: Create a 2D scatter plot showing all the data points. Use a different color for each class to make them distinguishable.
Analyze and Draw Boundaries:
- Examine the scatter plot carefully. Describe the distribution and overlap of the four classes.
- Class 0 and 1 have the most overlap, being pretty ditinguishable from the other two classes (numbers 2 and 3). The most segregated one is class 3, being at the far end of the X1 axis of the plot.
- Based on your visual inspection, could a simple, linear boundary separate all classes?
- I would argue that a line can be made to separate classes from each other, but the line would also put different classes on the same side. This would mean we would need at minimum a second line to properly separate all classes.
- On your plot, sketch the decision boundaries that you think a trained neural network might learn to separate these classes.
- Examine the scatter plot carefully. Describe the distribution and overlap of the four classes.
Exercise 2¶
Non-Linearity in Higher Dimensions¶
Simple neural networks (like a Perceptron) can only learn linear boundaries. Deep networks excel when data is not linearly separable. This exercise challenges you to create and visualize such a dataset.
Instructions¶
Generate the Data: Create a dataset with 500 samples for Class A and 500 samples for Class B. Use a multivariate normal distribution with the following parameters:
Class A:
Mean vector:
$$\mu_A = [0, 0, 0, 0, 0]$$
Covariance matrix:
$$ \Sigma_A = \begin{pmatrix} 1.0 & 0.8 & 0.1 & 0.0 & 0.0 \\ 0.8 & 1.0 & 0.3 & 0.0 & 0.0 \\ 0.1 & 0.3 & 1.0 & 0.5 & 0.0 \\ 0.0 & 0.0 & 0.5 & 1.0 & 0.2 \\ 0.0 & 0.0 & 0.0 & 0.2 & 1.0 \end{pmatrix} $$
Class B:
Mean vector:
$$\mu_B = [1.5, 1.5, 1.5, 1.5, 1.5]$$
Covariance matrix:
$$ \Sigma_B = \begin{pmatrix} 1.5 & -0.7 & 0.2 & 0.0 & 0.0 \\ -0.7 & 1.5 & 0.4 & 0.0 & 0.0 \\ 0.2 & 0.4 & 1.5 & 0.6 & 0.0 \\ 0.0 & 0.0 & 0.6 & 1.5 & 0.3 \\ 0.0 & 0.0 & 0.0 & 0.3 & 1.5 \end{pmatrix} $$
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
np.random.seed(42)
mu_A = [0, 0, 0, 0, 0]
Sigma_A = np.array([
[1.0, 0.8, 0.1, 0.0, 0.0],
[0.8, 1.0, 0.3, 0.0, 0.0],
[0.1, 0.3, 1.0, 0.5, 0.0],
[0.0, 0.0, 0.5, 1.0, 0.2],
[0.0, 0.0, 0.0, 0.2, 1.0]
])
mu_B = [1.5, 1.5, 1.5, 1.5, 1.5]
Sigma_B = np.array([
[1.5, -0.7, 0.2, 0.0, 0.0],
[-0.7, 1.5, 0.4, 0.0, 0.0],
[0.2, 0.4, 1.5, 0.6, 0.0],
[0.0, 0.0, 0.6, 1.5, 0.3],
[0.0, 0.0, 0.0, 0.3, 1.5]
])
class_A = np.random.multivariate_normal(mu_A, Sigma_A, size=500)
class_B = np.random.multivariate_normal(mu_B, Sigma_B, size=500)
X = np.vstack((class_A, class_B))
y = np.array([0]*500 + [1]*500)
print("Dataset shape:", X.shape)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
plt.figure(figsize=(8,6))
plt.scatter(X_pca[y==0, 0], X_pca[y==0, 1], alpha=0.6, label="Class A")
plt.scatter(X_pca[y==1, 0], X_pca[y==1, 1], alpha=0.6, label="Class B")
plt.title("PCA de dados Sintéticos 5D - Redução para 2D")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.legend()
plt.show()
- Visualize the Data: Since you cannot directly plot a 5D graph, you must reduce its dimensionality.
- Use a technique like Principal Component Analysis (PCA) to project the 5D data down to 2 dimensions.
- Create a scatter plot of this 2D representation, coloring the points by their class (A or B).
- Analyze the Plots:
Based on your 2D projection, describe the relationship between the two classes.
- Very intertwined, making it extremely difficult to separate them clearly. They do tend to different sides of the plot (in the x-axis), but they do have noticeable overlap.
Discuss the linear separability of the data. Explain why this type of data structure poses a challenge for simple linear models and would likely require a multi-layer neural network with non-linear activation functions to be classified accurately.
- Creating a single line that segregates these two classes is not possible, given that they do have overlap. If you were to trace a line between them, it would end up inevitably classifying class A as class B and vice-versa.
Exercise 3¶
Preparing Real-World Data for a Neural Network¶
This exercise uses a real dataset from Kaggle. Your task is to perform the necessary preprocessing to make it suitable for a neural network that uses the hyperbolic tangent (tanh) activation function in its hidden layers.
Instructions¶
Get the Data: Download the Spaceship Titanic dataset from Kaggle.
Describe the Data:
Briefly describe the dataset's objective (i.e., what does the
Transportedcolumn represent?).- We're trying to predict if a passenger was transported to another dimension during the crash of the Spaceship Titanic, the value that represents whether or not the passenger was transported is in the
Transported column
- We're trying to predict if a passenger was transported to another dimension during the crash of the Spaceship Titanic, the value that represents whether or not the passenger was transported is in the
List the features and identify which are numerical (e.g.,
Age,RoomService) and which are categorical (e.g.,HomePlanet,Destination).Investigate the dataset for missing values. Which columns have them, and how many?
Categorical Columns -
- HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.
- Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- Destination - The planet the passenger will be debarking to.
- Name - The first and last names of the passenger.
- PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
Boolean Columns -
- Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
- VIP - Whether the passenger has paid for special VIP service during the voyage.
- CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
Numerical Columns -
- Age - The age of the passenger.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
import pandas as pd
df = pd.read_csv("../../../data/SpaceshipTitanic/train.csv")
df.head()
| PassengerId | HomePlanet | CryoSleep | Cabin | Destination | Age | VIP | RoomService | FoodCourt | ShoppingMall | Spa | VRDeck | Name | Transported | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0001_01 | Europa | False | B/0/P | TRAPPIST-1e | 39.0 | False | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | Maham Ofracculy | False |
| 1 | 0002_01 | Earth | False | F/0/S | TRAPPIST-1e | 24.0 | False | 109.0 | 9.0 | 25.0 | 549.0 | 44.0 | Juanna Vines | True |
| 2 | 0003_01 | Europa | False | A/0/S | TRAPPIST-1e | 58.0 | True | 43.0 | 3576.0 | 0.0 | 6715.0 | 49.0 | Altark Susent | False |
| 3 | 0003_02 | Europa | False | A/0/S | TRAPPIST-1e | 33.0 | False | 0.0 | 1283.0 | 371.0 | 3329.0 | 193.0 | Solam Susent | False |
| 4 | 0004_01 | Earth | False | F/1/S | TRAPPIST-1e | 16.0 | False | 303.0 | 70.0 | 151.0 | 565.0 | 2.0 | Willy Santantines | True |
null_counts = df.isnull().sum()
print("Null values per column:")
print(null_counts)
Null values per column: PassengerId 0 HomePlanet 201 CryoSleep 217 Cabin 199 Destination 182 Age 179 VIP 203 RoomService 181 FoodCourt 183 ShoppingMall 208 Spa 183 VRDeck 188 Name 200 Transported 0 dtype: int64
For these missing values, we must treat numerical and categorical features separately.
For numerical features, we use a Simple Imputer to fill null values. In this instance, I chose the Median value to be used as filler. After that, we use a Standard Scaler to put all values centered at 0.
For categorical features, we use the most frequent class as a fill-in for missing values, and pass it to a OneHotEncoder afterwards so our categorical values become boolean features. In this case, this gave us 26 features compared to our initial 14
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
X = df.drop(["PassengerId", "Name"], axis=1)
print("Original Data Without PassengerId and Name")
print(X.shape)
X[["Deck", "Num", "Side"]] = X["Cabin"].str.split("/", expand=True)
X.drop("Cabin", axis=1, inplace=True)
print("\nAfter splitting Cabin into Deck, Number and Side of ship (3 new columns)")
print(X.shape)
num_features = ["Age", "RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]
cat_features = ["HomePlanet", "CryoSleep", "Destination", "VIP", "Deck", "Side"]
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
cat_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
preprocessor = ColumnTransformer([
("num", num_pipeline, num_features),
("cat", cat_pipeline, cat_features)
])
X_processed = preprocessor.fit_transform(X)
print("\nAfter Pipelines")
print(X_processed.shape)
Original Data Without PassengerId and Name (8693, 12) After splitting Cabin into Deck, Number and Side of ship (3 new columns) (8693, 14) After Pipelines (8693, 26)
- Preprocess the Data: Your goal is to clean and transform the data so it can be fed into a neural network. The
tanhactivation function produces outputs in the range[-1, 1], so your input data should be scaled appropriately for stable training.- Handle Missing Data: Devise and implement a strategy to handle the missing values in all the affected columns. Justify your choices.
- Encode Categorical Features: Convert categorical columns like
HomePlanet,CryoSleep, andDestinationinto a numerical format. One-hot encoding is a good choice. - Normalize/Standardize Numerical Features: Scale the numerical columns (e.g.,
Age,RoomService, etc.). Since thetanhactivation function is centered at zero and outputs values in[-1, 1], Standardization (to mean 0, std 1) or Normalization to a[-1, 1]range are excellent choices. Implement one and explain why it is a good practice for training neural networks with this activation function.
- Visualize the Results:
- Create histograms for one or two numerical features (like
FoodCourtorAge) before and after scaling to show the effect of your transformation.
- Create histograms for one or two numerical features (like
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
fig, axes = plt.subplots(len(num_features), 2, figsize=(25, 4*len(num_features)))
for i in range(len(num_features)):
feature = num_features[i]
df[feature].hist(ax=axes[i, 0], bins=30, color="skyblue")
axes[i, 0].set_title(f"{feature} before scaling")
scaled = StandardScaler().fit_transform(df[[feature]].fillna(df[feature].median()))
pd.Series(scaled.ravel()).hist(ax=axes[i, 1], bins=30, color="salmon")
axes[i, 1].set_title(f"{feature} after standardization")
plt.tight_layout()
plt.show()
Notice that the shape of the graphs is maintained, but the values in the x-axis are centered to 0.