Chapter 4: Data Manipulation & Analysis

Data Manipulation & Analysis, explained in full detail like we’re in a cozy café in Airoli — laptop open, me pointing at code cells in Jupyter Notebook, and explaining step by step why each technique matters for real data science work in 2026. This chapter is where theory turns practical: 70–80% of a data scientist’s time is spent here (cleaning, shaping, understanding data), not fancy modeling.

We’ll use NumPy for fast array math and Pandas for tabular data (the bread-and-butter tools). Then cleaning, and finally a full EDA workflow with a realistic example.

Setup note (do this now if not already):

Bash

Or use Anaconda. Work in Jupyter Notebook/Lab for interactive magic.

1. NumPy (Arrays, Broadcasting, Vectorization)

NumPy = Numerical Python. It’s the foundation under Pandas, Scikit-learn, PyTorch — everything fast in data science.

Why NumPy instead of Python lists? Lists are slow for math on thousands/millions of numbers. NumPy uses C under the hood → 10–100x faster.

ndarray (n-dimensional array) — core object.

Python

Broadcasting — magic that lets you operate on arrays of different shapes (without copying data).

Rule: dimensions match from right, or one is 1 (stretched).

Example (very common in ML feature scaling):

Python

Vectorization — avoid Python loops; do everything array-wise.

Slow bad way:

Python

Fast good way:

Python

In 2026: NumPy powers GPU acceleration in PyTorch too — learn it well.

2. Pandas (DataFrames, Series, Indexing, Merging, Grouping, Pivot Tables)

Pandas = Excel on steroids + Python power.

Series — 1D labeled array (like column).

DataFrame — 2D table with labeled rows/columns.

Python

Indexing & Selection (many ways — know these!)

Python

Merging / Joining (like SQL joins)

Python

Grouping & Aggregation (group by city, see average salary)

Python

Pivot Tables — like Excel pivot (summarize fast)

Python

3. Data Cleaning & Wrangling (Missing Values, Outliers, Duplicates)

Real data is messy — 60–80% of time here.

Missing values (NaN)

Python

Duplicates

Python

Outliers (extreme values that distort)

Common methods:

  • IQR method (robust)
Python
  • Z-score (for normal-ish data): |z| > 3 → outlier

Other wrangling:

  • Type conversion: df[‘Age’] = df[‘Age’].astype(int)
  • String cleaning: df[‘City’] = df[‘City’].str.strip().str.lower()
  • Rename: df.rename(columns={‘Salary’: ‘Annual_Salary’}, inplace=True)

4. Exploratory Data Analysis (EDA) Workflow

EDA = get to know your data deeply before modeling.

Standard workflow (follow this every time):

  1. Load & Inspect
    Python
  2. Check Quality
    • Missing: df.isnull().sum() / len(df) * 100 (% missing)
    • Duplicates: df.duplicated().sum()
    • Cardinality: df.nunique() (how many unique values per column)
  3. Univariate Analysis (one variable)
    • Numerical: histogram, boxplot, density
      Python
    • Categorical: value_counts, barplot
      Python
  4. Bivariate / Multivariate
    • Correlation: df.corr(numeric_only=True) → heatmap
      Python
    • Scatter: sns.scatterplot(x=’Age’, y=’Salary’, hue=’City’, data=df)
    • Pairplot: sns.pairplot(df, hue=’City’) (powerful overview)
  5. Feature Insights & Questions
    • Groupbys: average salary by city/age bin
    • Cross-tabs: pd.crosstab(df[‘City’], pd.cut(df[‘Age’], bins=5))
    • Derive features: df[‘Age_Group’] = pd.cut(df[‘Age’], bins=[20,30,40,50,100])
  6. Document Findings — notebook markdown cells: “Salary skewed right — many low earners, few high. Outliers above ₹20L possible CEOs.”

Realistic mini-EDA example (imagine we have a Mumbai housing CSV — common in India DS jobs)

Columns: Price, Area_sqft, Bedrooms, Location, Age_years, etc.

Typical discoveries:

  • Price highly right-skewed → log transform for modeling.
  • Strong corr between Area & Price (0.85).
  • Navi Mumbai cheaper than South Mumbai.
  • Missing in Location → impute with mode or group by pincode.

That’s Chapter 4 — the daily grind that makes or breaks projects!

Practice tip: Download Titanic from Kaggle (https://www.kaggle.com/c/titanic/data — train.csv), or India house prices (search Kaggle “House Price India”), and run full EDA.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *