Complete Guide to Python Data Analysis: From Pandas to Visualization

In-depth comparison of core data analysis tools like Pandas and NumPy, mastering the complete workflow from data cleaning to visualization, with practical code examples and modern best practices.

Python’s dominance in data analysis is hard to shake. It’s not just about clean syntax—it’s the mature ecosystem of tools. If you work with data, Python is essentially unavoidable.

Why Use Python for Data Analysis?

R is popular in statistics, but Python has broader applicability. Here are the practical reasons:

Python is easy to learn. The syntax resembles natural language, making it accessible for beginners while not limiting experienced users.

Comprehensive format support. CSV, Excel, JSON, SQL, Parquet—basically any format you encounter can be handled. From simple data cleaning to complex machine learning models, one toolkit does it all.

Big data isn’t a problem. Small datasets work directly with Pandas, large ones have Dask and PySpark. Running out of memory? Dask processes data in chunks.

Active community. Most issues can be resolved on Stack Overflow or GitHub. Library documentation is generally comprehensive.

Mature machine learning ecosystem. TensorFlow, PyTorch, Keras—all major frameworks are Python-based. Data analysis and modeling happen in the same environment without switching.

NumPy vs Pandas: Which Should You Use?

These two libraries are often compared, but they solve different problems.

NumPy’s Position

NumPy focuses on numerical computation. Its core is the multidimensional array, with all operations centered around arrays.

Speed is NumPy’s biggest advantage. When processing pure numerical data, NumPy arrays are significantly faster than Pandas DataFrames. Memory usage is also lower because array data types are uniform.

Mathematical operations are straightforward in NumPy. Matrix multiplication, statistical calculations, linear algebra operations—all have ready-made functions. Almost all machine learning libraries use NumPy under the hood because it’s fast.

import numpy as np

# 创建数组
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# 统计计算
mean = np.mean(data)
std = np.std(data)

# 矩阵转置
transposed = data.T

Use NumPy when: processing homogeneous numerical data, requiring high-performance computation, or as the underlying data structure for machine learning models.

Pandas’ Position

Pandas is built on top of NumPy, but its focus is data manipulation and analysis. The DataFrame structure is closer to Excel spreadsheets or SQL tables, making it intuitive to use.

Pandas can mix different data types in the same DataFrame. One column is strings, another is numbers, another is dates—no problem. This is common in real-world work.

Data cleaning is Pandas’ strength. Missing value handling, duplicate removal, data type conversion—all have concise APIs. You can access data by column name or index, more flexible than pure arrays.

import pandas as pd

# 创建 DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'salary': [50000, 60000, 70000]
})

# 筛选和聚合
high_earners = df[df['salary'] > 55000]
avg_age = df['age'].mean()

# 分组统计
grouped = df.groupby('age')['salary'].mean()

Use Pandas when: cleaning data, exploratory analysis, handling tabular data or time series.

Practical Choice

Honestly, most projects use both libraries. Pandas handles data processing and cleaning, NumPy handles underlying numerical computation. They’re not competitors—they’re complementary.

How to Choose Visualization Tools?

Python has many visualization libraries, but three are mainstream: Matplotlib, Seaborn, Plotly. Each has its use cases.

Matplotlib: Low-Level Control

Matplotlib is the most basic visualization library. If you need precise control over every detail of a chart, use it.

The downside is code volume. Drawing a simple scatter plot might require a dozen lines. The learning curve is steep, but once mastered, you can create any chart you want.

import matplotlib.pyplot as plt

plt.scatter(x='gdpPercap', y='lifeExp', data=df)
plt.title('GDP vs Life Expectancy')
plt.xlabel('GDP per Capita')
plt.ylabel('Life Expectancy')
plt.show()

Suitable for scenarios requiring high customization, like academic papers or publication-quality charts.

Seaborn: Fast and Beautiful

Seaborn is built on Matplotlib but with a much simpler API. Default styles are also more modern.

The biggest advantage is automatic handling of many details. For example, when drawing scatter plots, the hue parameter automatically colors by category and generates legends. Matplotlib requires loops for the same effect.

import seaborn as sns

# 自动添加图例和颜色
sns.scatterplot(x='gdpPercap', y='lifeExp', data=df, hue='continent')
plt.title('GDP vs Life Expectancy by Continent')
plt.show()

Seaborn also provides advanced statistical charts like heatmaps, violin plots, and pair plots. These are useful in exploratory data analysis.

You can mix Seaborn and Matplotlib. Use Seaborn to quickly generate charts, then use Matplotlib functions to fine-tune details.

Plotly: Interactive Experience

Plotly’s feature is interactivity. Hovering shows specific data point values, with zoom, pan, and image export capabilities.

import plotly.express as px

fig = px.scatter(df, x='gdpPercap', y='lifeExp',
                 color='continent', hover_name='country')
fig.show()

If you’re building data dashboards or presenting to non-technical people, Plotly is a good choice. Charts can be exported as HTML files and opened in browsers for interaction.

The downside is large file sizes, and sometimes performance issues in Jupyter Notebook.

My Recommendation

If you’re just starting, use Plotly. Simple code, intuitive results.

For exploratory analysis, use Seaborn. Quickly generate various statistical charts to help understand data.

When you need precise control or publication-quality charts, use Matplotlib. Although code-heavy, it can achieve any effect.

Practical Data Analysis Workflow

Theory covered, let’s look at practical operations. This is a typical data analysis workflow.

Step 1: Load Data

import pandas as pd
import numpy as np

# 读取 CSV
df = pd.read_csv('data.csv')

# 快速查看
print(df.head())
print(df.info())

head() displays the first few rows, info() shows column data types and missing values. These two functions quickly give you a basic understanding of the data.

Step 2: Clean Data

Missing values are the most common issue. Handling depends on the specific situation.

# 检查缺失值
print(df.isna().sum())

# 删除全是 NaN 的列
df = df.dropna(axis=1, how='all')

# 删除包含 NaN 的行
df = df.dropna()

# 或者用均值填充
df['column_name'].fillna(df['column_name'].mean(), inplace=True)

Drop rows or fill values? It depends on business logic. If missing data is minimal, direct deletion is fine. If the missing proportion is high, you might need filling or more complex imputation methods.

Step 3: Explore Data

Don’t rush into complex analysis. First use simple statistics and visualization to understand the data.

# 统计摘要
print(df.describe())

# 分布图
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.histplot(data=df, x='price', bins=30)
plt.title('Price Distribution')
plt.show()

describe() shows mean, standard deviation, min, max, etc. These numbers help you discover outliers or data range issues.

Step 4: Transform Data

Transform data according to analysis needs.

# 筛选
high_value = df[df['price'] > 100]

# 创建分类
df['price_category'] = pd.cut(df['price'],
                               bins=[0, 50, 100, 200],
                               labels=['Low', 'Medium', 'High'])

# 分组聚合
grouped = df.groupby('category')['price'].agg(['mean', 'median', 'std'])

Group aggregation is heavily used in real work. For example, calculating sales by region or user activity by time period.

Step 5: Correlation Analysis

Correlation matrices quickly reveal relationships between variables.

# 选择数值列
numeric_df = df.select_dtypes(include=['float64', 'int64'])

# 计算相关性
correlation = numeric_df.corr()

# 热力图
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(correlation))
sns.heatmap(correlation, annot=True, fmt='.2f',
            mask=mask, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Correlation coefficients above 0.6 are generally considered strong, 0.3-0.6 is moderate. But remember, correlation doesn’t equal causation.

Modern Best Practices

These are practices I’ve found useful in real projects.

Type Hints

Python 3.5+ supports type hints. While not mandatory, they make code clearer and enable IDE type checking.

from typing import List, Dict
import pandas as pd

def analyze_data(df: pd.DataFrame, columns: List[str]) -> Dict[str, float]:
    """分析指定列的统计信息"""
    return {col: df[col].mean() for col in columns}

Method Chaining

Pandas supports method chaining, making code more compact.

result = (df
    .query('price > 100')
    .groupby('category')
    .agg({'price': ['mean', 'std'], 'quantity': 'sum'})
    .reset_index()
)

But don’t overuse it. If chains are too long, they become hard to read. Generally keep it under 5 operations.

Jupyter Notebook

Jupyter Notebook is convenient for exploratory analysis. You can write code and see results simultaneously, plus use Markdown to record thoughts.

But Notebook isn’t suitable for production code. After analysis, extract core logic into Python scripts or modules.

Data Quality Checks

Develop a habit of checking data quality.

def check_data_quality(df: pd.DataFrame) -> None:
    print(f"数据形状: {df.shape}")
    print(f"缺失值总数: {df.isna().sum().sum()}")
    print(f"重复行数: {df.duplicated().sum()}")
    print(f"数据类型:\n{df.dtypes}")

This function quickly identifies data issues. Running it before analysis starts can avoid many subsequent problems.

Handling Large Data

Pandas processes data in memory. If datasets exceed memory size, problems arise.

Several solutions:

Dask can process datasets larger than memory in parallel. API is similar to Pandas, low migration cost.

Polars is a new DataFrame library, much faster than Pandas. But the ecosystem isn’t as mature as Pandas yet.

Optimizing data types also saves memory. For example, changing int64 to int32, or string columns to category type.

# 优化数据类型
df['category'] = df['category'].astype('category')
df['price'] = df['price'].astype('float32')

Reproducibility

Set random seeds and record environment information. This way others (or you months later) can reproduce results.

# 随机种子
np.random.seed(42)

# 环境信息
import sys
print(f"Python 版本: {sys.version}")
print(f"Pandas 版本: {pd.__version__}")
print(f"NumPy 版本: {np.__version__}")

Practical Tips

These are experiences I’ve summarized from real work.

Explore before analyzing. Don’t jump straight into modeling. Use describe(), info() and simple charts to understand data—this avoids many basic mistakes.

Handle outliers carefully. Box plots quickly identify outliers, but don’t blindly delete them. Sometimes outliers are the most valuable information.

Feature engineering matters more than models. A good feature is often more effective than a complex model. Spend time understanding business logic and creating meaningful features.

Document the analysis process. Use Markdown cells in Jupyter Notebook to record your thoughts and findings. You’ll thank yourself when reviewing months later.

Use Git for code management. Even for analysis scripts, version control is worthwhile. You can track modification history and facilitate team collaboration.

Summary

Python’s data analysis toolkit is mature. NumPy handles numerical computation, Pandas handles data processing, Matplotlib, Seaborn, and Plotly handle visualization. Each tool has its position.

Choose tools based on specific needs. NumPy for numerical computation, Pandas for data processing, Seaborn for quick charts, Plotly for interactive displays. Real projects typically use them in combination.

But tools are just tools. What truly matters is understanding business problems, knowing what data means, and experience accumulated through practice. These are the core competencies of data analysis.

Spread The Article

Share this guide

Send this article to your network or keep a copy of the direct link.

X Facebook LinkedIn Reddit Telegram

Discussion

Leave a comment

No comments yet

Be the first to start the conversation.