Hi everyone! 👋 I'm currently working through the Titanic dataset as part of the CampusX YouTube course, and I ran into an interesting issue involving OneHotEncoder and SimpleImputer that I finally understood after digging into the problem.

This blog is all about that journey — what caused the shape mismatch between training and testing data, and how I fixed it. If you're also working on preprocessing categorical variables in machine learning, this might save you a few hours of debugging!


🧠 The Setup

We’re using the Titanic dataset for classification (predicting survival), and like most people, I’m preprocessing the Sex and Embarked columns using:

  • SimpleImputer to handle missing values
  • OneHotEncoder to convert categorical variables into numerical format

Here’s a snippet of what I had:

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

si_embarked = SimpleImputer(strategy='most_frequent')
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')

# Imputation
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = si_embarked.transform(x_test[['Embarked']])

# Encoding
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])

⚠️ The Issue

After running this, I checked the shapes:

x_train_embarked.shape  → (712, 4)
x_test_embarked.shape   → (179, 3)

Wait — what?!

Why does the train set have 4 columns, and the test set has only 3, even though I used handle_unknown='ignore'?

Wasn’t that supposed to handle unknown categories safely?


🕵️‍♂️ Investigating the Root Cause

I ran a few more checks and realized something sneaky:

x_train['Embarked'].isnull().sum()  # Output: 2

Hmm… that’s weird. I thought I had already imputed missing values. But then I remembered this part of my code:

x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])

Aha! 💡 I had imputed missing values into a new variable, x_train_embarked, but I never updated the original x_train DataFrame!

That means the original x_train['Embarked'] still had NaN values when I called .fit() on the encoder:

x_train['Embarked'] = ohe_embarked.fit_transform(x_train[['Embarked']])

This caused the encoder to treat NaN as a valid category, resulting in 4 categories being learned:

['C', 'Q', 'S', NaN]

But in the test data, there were no NaN values, only:

['C', 'Q', 'S']

So the encoder ignored the unseen NaN category, resulting in:

x_test_embarked.shape = (179, 3)

✅ The Fix

The correct way was to assign the imputed values back to the original DataFrame:

x_train['Embarked'] = si_embarked.fit_transform(x_train[['Embarked']])
x_test['Embarked'] = si_embarked.transform(x_test[['Embarked']])

Now, when I fit the encoder:

ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])

✅ The shapes finally matched:

x_train_embarked.shape → (712, 3)
x_test_embarked.shape → (179, 3)

📌 Bonus: What handle_unknown='ignore' Really Means

Here’s a quick visual explanation:

Imagine your training data had these categories:

['Red', 'Blue', 'Green']

And you encode them like this:

Red Blue Green
1 0 0
0 1 0
0 0 1

Now your test data contains a new category: 'Yellow'.

If you use:

OneHotEncoder(handle_unknown='ignore')

Then the encoder will just assign all 0s for 'Yellow':

Red Blue Green
0 0 0

✅ No crash. But also — you now have a row of all zeros!


🎓 Final Takeaways

  • Always handle missing values before encoding
  • If you’re using SimpleImputer, assign the output back to your original DataFrame
  • handle_unknown='ignore' prevents errors, but doesn’t fix shape mismatches caused by unseen categories during .fit()

This was a great learning moment for me while working through the Titanic dataset with CampusX. Hope this helps anyone else facing the same mystery! 🧩

Let me know if you've run into similar preprocessing surprises!