Hi everyone! 👋 I'm currently working through the Titanic dataset as part of the CampusX YouTube course, and I ran into an interesting issue involving OneHotEncoder
and SimpleImputer
that I finally understood after digging into the problem.
This blog is all about that journey — what caused the shape mismatch between training and testing data, and how I fixed it. If you're also working on preprocessing categorical variables in machine learning, this might save you a few hours of debugging!
🧠 The Setup
We’re using the Titanic dataset for classification (predicting survival), and like most people, I’m preprocessing the Sex
and Embarked
columns using:
- SimpleImputer to handle missing values
- OneHotEncoder to convert categorical variables into numerical format
Here’s a snippet of what I had:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
si_embarked = SimpleImputer(strategy='most_frequent')
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
# Imputation
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = si_embarked.transform(x_test[['Embarked']])
# Encoding
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])
⚠️ The Issue
After running this, I checked the shapes:
x_train_embarked.shape → (712, 4)
x_test_embarked.shape → (179, 3)
Wait — what?!
Why does the train set have 4 columns, and the test set has only 3, even though I used handle_unknown='ignore'
?
Wasn’t that supposed to handle unknown categories safely?
🕵️♂️ Investigating the Root Cause
I ran a few more checks and realized something sneaky:
x_train['Embarked'].isnull().sum() # Output: 2
Hmm… that’s weird. I thought I had already imputed missing values. But then I remembered this part of my code:
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])
Aha! 💡 I had imputed missing values into a new variable, x_train_embarked
, but I never updated the original x_train
DataFrame!
That means the original x_train['Embarked']
still had NaN
values when I called .fit()
on the encoder:
x_train['Embarked'] = ohe_embarked.fit_transform(x_train[['Embarked']])
This caused the encoder to treat NaN
as a valid category, resulting in 4 categories being learned:
['C', 'Q', 'S', NaN]
But in the test data, there were no NaN values, only:
['C', 'Q', 'S']
So the encoder ignored the unseen NaN
category, resulting in:
x_test_embarked.shape = (179, 3)
✅ The Fix
The correct way was to assign the imputed values back to the original DataFrame:
x_train['Embarked'] = si_embarked.fit_transform(x_train[['Embarked']])
x_test['Embarked'] = si_embarked.transform(x_test[['Embarked']])
Now, when I fit the encoder:
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])
✅ The shapes finally matched:
x_train_embarked.shape → (712, 3)
x_test_embarked.shape → (179, 3)
📌 Bonus: What handle_unknown='ignore'
Really Means
Here’s a quick visual explanation:
Imagine your training data had these categories:
['Red', 'Blue', 'Green']
And you encode them like this:
Red | Blue | Green |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Now your test data contains a new category: 'Yellow'
.
If you use:
OneHotEncoder(handle_unknown='ignore')
Then the encoder will just assign all 0s for 'Yellow'
:
Red | Blue | Green |
---|---|---|
0 | 0 | 0 |
✅ No crash. But also — you now have a row of all zeros!
🎓 Final Takeaways
- Always handle missing values before encoding
- If you’re using
SimpleImputer
, assign the output back to your original DataFrame -
handle_unknown='ignore'
prevents errors, but doesn’t fix shape mismatches caused by unseen categories during.fit()
This was a great learning moment for me while working through the Titanic dataset with CampusX. Hope this helps anyone else facing the same mystery! 🧩
Let me know if you've run into similar preprocessing surprises!