Raw data rarely arrives in a form ready for analysis or machine learning. Datasets contain missing values, inconsistent formatting, categorical variables that need encoding, and features that require scaling. Data preprocessing turns messy real-world data into clean, consistent format ready for analysis.
This guide covers essential data preprocessing techniques with Python. You will learn practical methods for handling common data quality problems, preparing features for models, and building reproducible preprocessing pipelines.
Why Data Preprocessing Matters
Machine learning models learn patterns from data. When data contains errors, inconsistencies, or missing values, model performance suffers. The quality of your output depends entirely on the quality of your input.
Common data quality problems:
Missing values occur when data collection fails or fields are not applicable. Incorrect data types prevent proper analysis and modeling. Inconsistent formatting creates noise that obscures real patterns. Categorical variables require encoding before most algorithms can use them. Features with different scales bias algorithms toward larger values.
Impact on model performance:
A dataset with 10% missing values, if handled incorrectly, can reduce model accuracy by 5-15%. Unscaled features can make gradient descent converge slowly or fail entirely. Rare categories can cause overfitting or underfitting depending on how they are encoded.
Handling Missing Values
Missing values are the most common data quality problem. Understanding why data is missing helps you choose the right strategy.
Types of Missing Data
Missing Completely at Random (MCAR): The probability of missingness is the same for all observations. Nothing systematic causes the missing values. Example: a sensor randomly fails and records no data.
Missing at Random (MAR): The probability of missingness depends on observed data but not unobserved data. Example: younger users are less likely to provide their birth date.
Missing Not at Random (MNAR): The probability of missingness depends on unobserved data. Example: users with low income are more likely to skip the income field.
Detection and Analysis
import pandas as pd
import numpy as np
# Load dataset
df = pd.read_csv("customer_data.csv")
# Find missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'count': missing,
'percentage': missing_pct
})
missing_df = missing_df[missing_df['count'] > 0].sort_values('percentage', ascending=False)
print("Columns with missing values:")
print(missing_df)
# Analyze patterns
print("\nMissing by category:")
print(df.groupby('region')['income'].apply(lambda x: x.isnull().mean())
Strategies for Different Columns
Drop columns with too many missing values:
# Drop columns with more than 50% missing
threshold = len(df) * 0.5
df_cleaned = df.dropna(axis=1, thresh=threshold)
# Drop rows with any missing values (use carefully)
df_complete = df.dropna()
Fill with statistical measures:
# Numeric columns - fill with median (robust to outliers)
df['age'] = df['age'].fillna(df['age'].median())
# Categorical columns - fill with mode
df['city'] = df['city'].fillna(df['city'].mode()[0])
# Time series - forward fill
df['temperature'] = df['temperature'].fillna(method='ffill')
# Group-specific imputation
df['income'] = df.groupby('education')['income'].transform(
lambda x: x.fillna(x.median())
)
Advanced imputation with sklearn:
from sklearn.impute import SimpleImputer, KNNImputer
# Simple imputation (median/mode)
numeric_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')
# KNN imputation (considers similar rows)
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
knn_imputer.fit_transform(df.select_dtypes(include=['number'])),
columns=df.select_dtypes(include=['number']).columns
)
# Iterative imputation (model-based)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
imputer = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=10, random_state=42),
max_iter=10
)
df_imputed = pd.DataFrame(
imputer.fit_transform(df),
columns=df.columns
)
Encoding Categorical Variables
Machine learning algorithms require numerical input. Categorical variables need encoding before modeling.
Label Encoding
from sklearn.preprocessing import LabelEncoder
# Simple label encoding
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# Preserve mappings for later use
category_mapping = dict(zip(le.classes_, range(len(le.classes_))))
print(category_mapping) # {'A': 0, 'B': 1, 'C': 2}
One-Hot Encoding
# Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
# sklearn OneHotEncoder (better for pipelines)
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['category']])
encoded_df = pd.DataFrame(
encoded,
columns=encoder.get_feature_names_out(['category'])
)
df = pd.concat([df, encoded_df], axis=1)
Target Encoding
# Mean encoding (use with caution to prevent leakage)
target_means = df.groupby('category')['target'].mean()
df['category_target_encoded'] = df['category'].map(target_means)
# Smoothed target encoding
global_mean = df['target'].mean()
smooth_factor = 10
df['category_smoothed'] = df.groupby('category').apply(
lambda x: (x['target'].sum() + smooth_factor * global_mean) /
(len(x) + smooth_factor)
).reset_index(level=0, drop=True)
Ordinal Encoding
# When categories have natural order
size_mapping = {'small': 0, 'medium': 1, 'large': 2, 'extra_large': 3}
df['size_ordinal'] = df['size'].map(size_mapping)
Feature Scaling
Many algorithms perform better when features have similar scales. Choose scaling method based on your data distribution.
StandardScaler (Z-score normalization)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['value_scaled'] = scaler.fit_transform(df[['value']])
# Access parameters
mean = scaler.mean_[0] # Mean used
std = scaler.scale_[0] # Standard deviation used
MinMaxScaler (0-1 range)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df['value_normalized'] = scaler.fit_transform(df[['value']])
# Custom range
scaler = MinMaxScaler(feature_range=(0, 10))
df['value_custom'] = scaler.fit_transform(df[['value']])
RobustScaler (outlier-resistant)
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0))
df['value_robust'] = scaler.fit_transform(df[['value']])
When to Use Each Scaler
| Scaler | Best For | Sensitive to Outliers |
|---|---|---|
| StandardScaler | Normal distributions | Yes |
| MinMaxScaler | Bounded distributions | Yes |
| RobustScaler | Presence of outliers | No |
| PowerTransformer | Skewed distributions | Moderate |
| QuantileTransformer | Uniform distributions | No |
Handling Outliers
Outliers can skew statistics and model performance. Handle them based on domain knowledge and analysis goals.
Detection Methods
import numpy as np
# Z-score method
def detect_zscore_outliers(series, threshold=3):
z_scores = (series - series.mean()) / series.std()
return np.abs(z_scores) > threshold
# IQR method
def detect_iqr_outliers(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return (series < lower) | (series > upper)
# Isolation Forest
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_labels = iso_forest.fit_predict(df[['value']])
df['is_outlier'] = outlier_labels == -1
Treatment Options
# Option 1: Remove outliers
df_cleaned = df[~detect_zscore_outliers(df['value'])]
# Option 2: Cap values
upper_cap = df['value'].quantile(0.95)
lower_cap = df['value'].quantile(0.05)
df['value_capped'] = df['value'].clip(lower_cap, upper_cap)
# Option 3: Transform (log for right-skewed data)
df['value_log'] = np.log1p(df['value']) # log(1+x) to handle zeros
# Option 4: Bin into categories
df['value_bin'] = pd.qcut(df['value'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])
Building Preprocessing Pipelines
Combine multiple preprocessing steps into reproducible pipelines.
sklearn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define preprocessing for different column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['city', 'education', 'occupation']
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# Complete pipeline with model
from sklearn.ensemble import RandomForestClassifier
model_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
Custom Transformers
from sklearn.base import BaseEstimator, TransformerMixin
class DateTimeFeatureExtractor(BaseEstimator, TransformerMixin):
"""Extract features from datetime column."""
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform(self, X):
df = pd.DataFrame(X)
df[self.column] = pd.to_datetime(df[self.column])
return df[[
df[self.column].dt.year,
df[self.column].dt.month,
df[self.column].dt.dayofweek,
df[self.column].dt.hour
]].values
# Usage
custom_transformer = DateTimeFeatureExtractor(column='timestamp')
ColumnTransformer for Mixed Data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd
# Define columns for each transformation
numeric_cols = ['age', 'income', 'score']
categorical_cols = ['category', 'region']
# Build preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_cols),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
])
# Fit and transform
X_processed = preprocessor.fit_transform(X)
# Get feature names
num_feature_names = numeric_cols
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
all_features = list(num_feature_names) + list(cat_feature_names)
Text Data Preprocessing
Text requires special preprocessing before analysis or modeling.
Basic Text Cleaning
import re
import string
def clean_text(text):
if pd.isna(text):
return ""
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
# Remove HTML tags
text = re.sub(r'<.*?>', '', text)
# Remove punctuation
text = text.translate(str.maketrans('', '', string.punctuation))
# Remove extra whitespace
text = ' '.join(text.split())
return text
df['cleaned_text'] = df['raw_text'].apply(clean_text)
Tokenization and Stopword Removal
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
stop_words = set(stopwords.words('english'))
def tokenize_and_filter(text):
tokens = word_tokenize(text)
return [t for t in tokens if t not in stop_words and t.isalpha()]
df['tokens'] = df['cleaned_text'].apply(tokenize_and_filter)
TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(
max_features=1000,
ngram_range=(1, 2), # Unigrams and bigrams
min_df=2,
max_df=0.95
)
tfidf_matrix = tfidf.fit_transform(df['cleaned_text'])
tfidf_df = pd.DataFrame(
tfidf_matrix.toarray(),
columns=tfidf.get_feature_names_out()
)
Data Validation After Preprocessing
Always validate your preprocessed data before modeling.
def validate_preprocessed_data(df, original_schema):
"""Validate preprocessed data matches expected schema."""
checks = []
# Check for remaining missing values
missing = df.isnull().sum().sum()
checks.append(('no_missing_values', missing == 0))
# Check data types
expected_types = {
'age': 'float64',
'income': 'float64',
'category_encoded': 'int64'
}
for col, expected in expected_types.items():
actual = str(df[col].dtype)
checks.append((f'{col}_correct_type', actual == expected))
# Check value ranges
checks.append(('age_in_range', (df['age'] >= 0).all() and (df['age'] <= 120).all()))
checks.append(('income_non_negative', (df['income'] >= 0).all()))
# Report results
for check_name, result in checks:
status = "PASS" if result else "FAIL"
print(f"{check_name}: {status}")
return all(result for _, result in checks)
is_valid = validate_preprocessed_data(df_processed, schema)
Summary
Data preprocessing turns raw data into analysis-ready format. Key techniques covered in this guide:
Missing value handling requires understanding why data is missing. Choose strategies based on the missing data type and business context. Imputation with median, mode, or advanced methods like KNN can fill gaps reasonably.
Categorical encoding converts non-numeric data into model-compatible formats. One-hot encoding works for nominal categories. Target encoding captures category-target relationships but requires careful handling to prevent leakage.
Feature scaling normalizes input ranges for better algorithm performance. StandardScaler works for normal distributions. RobustScaler handles outliers better. Choose based on your data characteristics.
Building pipelines ensures reproducibility. sklearn Pipeline and ColumnTransformer combine preprocessing steps. Custom transformers extend functionality for specific needs.
Outlier treatment depends on your goals. Sometimes outliers contain useful information. Sometimes they obscure patterns. Analyze before deciding.
Text preprocessing requires specialized techniques. Tokenization, stopword removal, and vectorization prepare text for analysis.
For more data science content, check our guides on Pandas memory optimization and machine learning fundamentals.
Sources:
Discussion
Leave a comment
No comments yet
Be the first to start the conversation.