Python Data Preprocessing: Complete Guide to Clean Data

Master data preprocessing techniques with Python. Learn to handle missing values, encode categories, scale features, and prepare datasets for machine learning with practical examples.

Raw data rarely arrives in a form ready for analysis or machine learning. Datasets contain missing values, inconsistent formatting, categorical variables that need encoding, and features that require scaling. Data preprocessing turns messy real-world data into clean, consistent format ready for analysis.

This guide covers essential data preprocessing techniques with Python. You will learn practical methods for handling common data quality problems, preparing features for models, and building reproducible preprocessing pipelines.

Why Data Preprocessing Matters

Machine learning models learn patterns from data. When data contains errors, inconsistencies, or missing values, model performance suffers. The quality of your output depends entirely on the quality of your input.

Common data quality problems:

Missing values occur when data collection fails or fields are not applicable. Incorrect data types prevent proper analysis and modeling. Inconsistent formatting creates noise that obscures real patterns. Categorical variables require encoding before most algorithms can use them. Features with different scales bias algorithms toward larger values.

Impact on model performance:

A dataset with 10% missing values, if handled incorrectly, can reduce model accuracy by 5-15%. Unscaled features can make gradient descent converge slowly or fail entirely. Rare categories can cause overfitting or underfitting depending on how they are encoded.

Handling Missing Values

Missing values are the most common data quality problem. Understanding why data is missing helps you choose the right strategy.

Types of Missing Data

Missing Completely at Random (MCAR): The probability of missingness is the same for all observations. Nothing systematic causes the missing values. Example: a sensor randomly fails and records no data.

Missing at Random (MAR): The probability of missingness depends on observed data but not unobserved data. Example: younger users are less likely to provide their birth date.

Missing Not at Random (MNAR): The probability of missingness depends on unobserved data. Example: users with low income are more likely to skip the income field.

Detection and Analysis

import pandas as pd
import numpy as np

# Load dataset
df = pd.read_csv("customer_data.csv")

# Find missing values
missing = df.isnull().sum()
missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'count': missing,
    'percentage': missing_pct
})
missing_df = missing_df[missing_df['count'] > 0].sort_values('percentage', ascending=False)

print("Columns with missing values:")
print(missing_df)

# Analyze patterns
print("\nMissing by category:")
print(df.groupby('region')['income'].apply(lambda x: x.isnull().mean())

Strategies for Different Columns

Drop columns with too many missing values:

# Drop columns with more than 50% missing
threshold = len(df) * 0.5
df_cleaned = df.dropna(axis=1, thresh=threshold)

# Drop rows with any missing values (use carefully)
df_complete = df.dropna()

Fill with statistical measures:

# Numeric columns - fill with median (robust to outliers)
df['age'] = df['age'].fillna(df['age'].median())

# Categorical columns - fill with mode
df['city'] = df['city'].fillna(df['city'].mode()[0])

# Time series - forward fill
df['temperature'] = df['temperature'].fillna(method='ffill')

# Group-specific imputation
df['income'] = df.groupby('education')['income'].transform(
    lambda x: x.fillna(x.median())
)

Advanced imputation with sklearn:

from sklearn.impute import SimpleImputer, KNNImputer

# Simple imputation (median/mode)
numeric_imputer = SimpleImputer(strategy='median')
categorical_imputer = SimpleImputer(strategy='most_frequent')

# KNN imputation (considers similar rows)
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = pd.DataFrame(
    knn_imputer.fit_transform(df.select_dtypes(include=['number'])),
    columns=df.select_dtypes(include=['number']).columns
)

# Iterative imputation (model-based)
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor

imputer = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10
)
df_imputed = pd.DataFrame(
    imputer.fit_transform(df),
    columns=df.columns
)

Encoding Categorical Variables

Machine learning algorithms require numerical input. Categorical variables need encoding before modeling.

Label Encoding

from sklearn.preprocessing import LabelEncoder

# Simple label encoding
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])

# Preserve mappings for later use
category_mapping = dict(zip(le.classes_, range(len(le.classes_))))
print(category_mapping)  # {'A': 0, 'B': 1, 'C': 2}

One-Hot Encoding

# Pandas get_dummies
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')

# sklearn OneHotEncoder (better for pipelines)
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoded = encoder.fit_transform(df[['category']])
encoded_df = pd.DataFrame(
    encoded,
    columns=encoder.get_feature_names_out(['category'])
)
df = pd.concat([df, encoded_df], axis=1)

Target Encoding

# Mean encoding (use with caution to prevent leakage)
target_means = df.groupby('category')['target'].mean()
df['category_target_encoded'] = df['category'].map(target_means)

# Smoothed target encoding
global_mean = df['target'].mean()
smooth_factor = 10

df['category_smoothed'] = df.groupby('category').apply(
    lambda x: (x['target'].sum() + smooth_factor * global_mean) /
               (len(x) + smooth_factor)
).reset_index(level=0, drop=True)

Ordinal Encoding

# When categories have natural order
size_mapping = {'small': 0, 'medium': 1, 'large': 2, 'extra_large': 3}
df['size_ordinal'] = df['size'].map(size_mapping)

Feature Scaling

Many algorithms perform better when features have similar scales. Choose scaling method based on your data distribution.

StandardScaler (Z-score normalization)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df['value_scaled'] = scaler.fit_transform(df[['value']])

# Access parameters
mean = scaler.mean_[0]  # Mean used
std = scaler.scale_[0]   # Standard deviation used

MinMaxScaler (0-1 range)

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['value_normalized'] = scaler.fit_transform(df[['value']])

# Custom range
scaler = MinMaxScaler(feature_range=(0, 10))
df['value_custom'] = scaler.fit_transform(df[['value']])

RobustScaler (outlier-resistant)

from sklearn.preprocessing import RobustScaler

scaler = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0))
df['value_robust'] = scaler.fit_transform(df[['value']])

When to Use Each Scaler

ScalerBest ForSensitive to Outliers
StandardScalerNormal distributionsYes
MinMaxScalerBounded distributionsYes
RobustScalerPresence of outliersNo
PowerTransformerSkewed distributionsModerate
QuantileTransformerUniform distributionsNo

Handling Outliers

Outliers can skew statistics and model performance. Handle them based on domain knowledge and analysis goals.

Detection Methods

import numpy as np

# Z-score method
def detect_zscore_outliers(series, threshold=3):
    z_scores = (series - series.mean()) / series.std()
    return np.abs(z_scores) > threshold

# IQR method
def detect_iqr_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    return (series < lower) | (series > upper)

# Isolation Forest
from sklearn.ensemble import IsolationForest

iso_forest = IsolationForest(contamination=0.05, random_state=42)
outlier_labels = iso_forest.fit_predict(df[['value']])
df['is_outlier'] = outlier_labels == -1

Treatment Options

# Option 1: Remove outliers
df_cleaned = df[~detect_zscore_outliers(df['value'])]

# Option 2: Cap values
upper_cap = df['value'].quantile(0.95)
lower_cap = df['value'].quantile(0.05)
df['value_capped'] = df['value'].clip(lower_cap, upper_cap)

# Option 3: Transform (log for right-skewed data)
df['value_log'] = np.log1p(df['value'])  # log(1+x) to handle zeros

# Option 4: Bin into categories
df['value_bin'] = pd.qcut(df['value'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

Building Preprocessing Pipelines

Combine multiple preprocessing steps into reproducible pipelines.

sklearn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Define preprocessing for different column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['city', 'education', 'occupation']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Complete pipeline with model
from sklearn.ensemble import RandomForestClassifier

model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

Custom Transformers

from sklearn.base import BaseEstimator, TransformerMixin

class DateTimeFeatureExtractor(BaseEstimator, TransformerMixin):
    """Extract features from datetime column."""

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        df = pd.DataFrame(X)
        df[self.column] = pd.to_datetime(df[self.column])
        return df[[
            df[self.column].dt.year,
            df[self.column].dt.month,
            df[self.column].dt.dayofweek,
            df[self.column].dt.hour
        ]].values

# Usage
custom_transformer = DateTimeFeatureExtractor(column='timestamp')

ColumnTransformer for Mixed Data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
import pandas as pd

# Define columns for each transformation
numeric_cols = ['age', 'income', 'score']
categorical_cols = ['category', 'region']

# Build preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols)
    ])

# Fit and transform
X_processed = preprocessor.fit_transform(X)

# Get feature names
num_feature_names = numeric_cols
cat_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_cols)
all_features = list(num_feature_names) + list(cat_feature_names)

Text Data Preprocessing

Text requires special preprocessing before analysis or modeling.

Basic Text Cleaning

import re
import string

def clean_text(text):
    if pd.isna(text):
        return ""
    # Lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

df['cleaned_text'] = df['raw_text'].apply(clean_text)

Tokenization and Stopword Removal

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

stop_words = set(stopwords.words('english'))

def tokenize_and_filter(text):
    tokens = word_tokenize(text)
    return [t for t in tokens if t not in stop_words and t.isalpha()]

df['tokens'] = df['cleaned_text'].apply(tokenize_and_filter)

TF-IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features=1000,
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=2,
    max_df=0.95
)

tfidf_matrix = tfidf.fit_transform(df['cleaned_text'])
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf.get_feature_names_out()
)

Data Validation After Preprocessing

Always validate your preprocessed data before modeling.

def validate_preprocessed_data(df, original_schema):
    """Validate preprocessed data matches expected schema."""
    checks = []

    # Check for remaining missing values
    missing = df.isnull().sum().sum()
    checks.append(('no_missing_values', missing == 0))

    # Check data types
    expected_types = {
        'age': 'float64',
        'income': 'float64',
        'category_encoded': 'int64'
    }
    for col, expected in expected_types.items():
        actual = str(df[col].dtype)
        checks.append((f'{col}_correct_type', actual == expected))

    # Check value ranges
    checks.append(('age_in_range', (df['age'] >= 0).all() and (df['age'] <= 120).all()))
    checks.append(('income_non_negative', (df['income'] >= 0).all()))

    # Report results
    for check_name, result in checks:
        status = "PASS" if result else "FAIL"
        print(f"{check_name}: {status}")

    return all(result for _, result in checks)

is_valid = validate_preprocessed_data(df_processed, schema)

Summary

Data preprocessing turns raw data into analysis-ready format. Key techniques covered in this guide:

Missing value handling requires understanding why data is missing. Choose strategies based on the missing data type and business context. Imputation with median, mode, or advanced methods like KNN can fill gaps reasonably.

Categorical encoding converts non-numeric data into model-compatible formats. One-hot encoding works for nominal categories. Target encoding captures category-target relationships but requires careful handling to prevent leakage.

Feature scaling normalizes input ranges for better algorithm performance. StandardScaler works for normal distributions. RobustScaler handles outliers better. Choose based on your data characteristics.

Building pipelines ensures reproducibility. sklearn Pipeline and ColumnTransformer combine preprocessing steps. Custom transformers extend functionality for specific needs.

Outlier treatment depends on your goals. Sometimes outliers contain useful information. Sometimes they obscure patterns. Analyze before deciding.

Text preprocessing requires specialized techniques. Tokenization, stopword removal, and vectorization prepare text for analysis.

For more data science content, check our guides on Pandas memory optimization and machine learning fundamentals.


Sources:

Spread The Article

Share this guide

Send this article to your network or keep a copy of the direct link.

X Facebook LinkedIn Reddit Telegram

Discussion

Leave a comment

No comments yet

Be the first to start the conversation.