Hi everyone! 👋 I'm currently working through the Titanic dataset as part of the CampusX YouTube course, and I ran into an interesting issue involving OneHotEncoder and SimpleImputer that I finally understood after digging into the problem.
This blog is all about that journey — what caused the shape mismatch between training and testing data, and how I fixed it. If you're also working on preprocessing categorical variables in machine learning, this might save you a few hours of debugging!
🧠 The Setup
We’re using the Titanic dataset for classification (predicting survival), and like most people, I’m preprocessing the Sex and Embarked columns using:
- SimpleImputer to handle missing values
- OneHotEncoder to convert categorical variables into numerical format
Here’s a snippet of what I had:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
si_embarked = SimpleImputer(strategy='most_frequent')
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
# Imputation
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = si_embarked.transform(x_test[['Embarked']])
# Encoding
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])⚠️ The Issue
After running this, I checked the shapes:
x_train_embarked.shape → (712, 4)
x_test_embarked.shape → (179, 3)Wait — what?!
Why does the train set have 4 columns, and the test set has only 3, even though I used handle_unknown='ignore'?
Wasn’t that supposed to handle unknown categories safely?
🕵️♂️ Investigating the Root Cause
I ran a few more checks and realized something sneaky:
x_train['Embarked'].isnull().sum() # Output: 2Hmm… that’s weird. I thought I had already imputed missing values. But then I remembered this part of my code:
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])Aha! 💡 I had imputed missing values into a new variable, x_train_embarked, but I never updated the original x_train DataFrame!
That means the original x_train['Embarked'] still had NaN values when I called .fit() on the encoder:
x_train['Embarked'] = ohe_embarked.fit_transform(x_train[['Embarked']])This caused the encoder to treat NaN as a valid category, resulting in 4 categories being learned:
['C', 'Q', 'S', NaN]But in the test data, there were no NaN values, only:
['C', 'Q', 'S']So the encoder ignored the unseen NaN category, resulting in:
x_test_embarked.shape = (179, 3)✅ The Fix
The correct way was to assign the imputed values back to the original DataFrame:
x_train['Embarked'] = si_embarked.fit_transform(x_train[['Embarked']])
x_test['Embarked'] = si_embarked.transform(x_test[['Embarked']])Now, when I fit the encoder:
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])✅ The shapes finally matched:
x_train_embarked.shape → (712, 3)
x_test_embarked.shape → (179, 3)
📌 Bonus: What handle_unknown='ignore' Really Means
Here’s a quick visual explanation:
Imagine your training data had these categories:
['Red', 'Blue', 'Green']And you encode them like this:
| Red | Blue | Green |
|---|---|---|
| 1 | 0 | 0 |
| 0 | 1 | 0 |
| 0 | 0 | 1 |
Now your test data contains a new category: 'Yellow'.
If you use:
OneHotEncoder(handle_unknown='ignore')Then the encoder will just assign all 0s for 'Yellow':
| Red | Blue | Green |
|---|---|---|
| 0 | 0 | 0 |
✅ No crash. But also — you now have a row of all zeros!
🎓 Final Takeaways
- Always handle missing values before encoding
- If you’re using
SimpleImputer, assign the output back to your original DataFrame -
handle_unknown='ignore'prevents errors, but doesn’t fix shape mismatches caused by unseen categories during.fit()
This was a great learning moment for me while working through the Titanic dataset with CampusX. Hope this helps anyone else facing the same mystery! 🧩
Let me know if you've run into similar preprocessing surprises!