Chapter 4: Data Manipulation & Analysis

Data Manipulation & Analysis, explained in full detail like we’re in a cozy café in Airoli — laptop open, me pointing at code cells in Jupyter Notebook, and explaining step by step why each technique matters for real data science work in 2026. This chapter is where theory turns practical: 70–80% of a data scientist’s time is spent here (cleaning, shaping, understanding data), not fancy modeling.

We’ll use NumPy for fast array math and Pandas for tabular data (the bread-and-butter tools). Then cleaning, and finally a full EDA workflow with a realistic example.

Setup note (do this now if not already):

Bash

Or use Anaconda. Work in Jupyter Notebook/Lab for interactive magic.

1. NumPy (Arrays, Broadcasting, Vectorization)

NumPy = Numerical Python. It’s the foundation under Pandas, Scikit-learn, PyTorch — everything fast in data science.

Why NumPy instead of Python lists? Lists are slow for math on thousands/millions of numbers. NumPy uses C under the hood → 10–100x faster.

ndarray (n-dimensional array) — core object.

Python

Broadcasting — magic that lets you operate on arrays of different shapes (without copying data).

Rule: dimensions match from right, or one is 1 (stretched).

Example (very common in ML feature scaling):

Python

Vectorization — avoid Python loops; do everything array-wise.

Slow bad way:

Python

Fast good way:

Python

In 2026: NumPy powers GPU acceleration in PyTorch too — learn it well.

2. Pandas (DataFrames, Series, Indexing, Merging, Grouping, Pivot Tables)

Pandas = Excel on steroids + Python power.

Series — 1D labeled array (like column).

DataFrame — 2D table with labeled rows/columns.

Python

Indexing & Selection (many ways — know these!)

Python

Merging / Joining (like SQL joins)

Python

Grouping & Aggregation (group by city, see average salary)

Python

Pivot Tables — like Excel pivot (summarize fast)

Python

3. Data Cleaning & Wrangling (Missing Values, Outliers, Duplicates)

Real data is messy — 60–80% of time here.

Missing values (NaN)

Python

Duplicates

Python

Outliers (extreme values that distort)

Common methods:

IQR method (robust)

Python

Z-score (for normal-ish data): |z| > 3 → outlier

Other wrangling:

Type conversion: df[‘Age’] = df[‘Age’].astype(int)
String cleaning: df[‘City’] = df[‘City’].str.strip().str.lower()
Rename: df.rename(columns={‘Salary’: ‘Annual_Salary’}, inplace=True)

4. Exploratory Data Analysis (EDA) Workflow

EDA = get to know your data deeply before modeling.

Standard workflow (follow this every time):

Load & Inspect

Python

Check Quality
- Missing: df.isnull().sum() / len(df) * 100 (% missing)
- Duplicates: df.duplicated().sum()
- Cardinality: df.nunique() (how many unique values per column)
Univariate Analysis (one variable)
- Numerical: histogram, boxplot, density
  
  Python
  
  PHP
  
  0
  1
  2
  3
  4
  5
  6
  7
  8
  9
  10
  11
  12
  
  import matplotlib.pyplot as plt
  import seaborn as sns
  
  sns.histplot(df['Salary'], kde=True)
  plt.show()
  
  sns.boxplot(x=df['Age'])
- Categorical: value_counts, barplot
  
  Python
  
  PHP
  
  0
  1
  2
  3
  4
  5
  6
  
  df['City'].value_counts().plot(kind='bar')
Bivariate / Multivariate
- Correlation: df.corr(numeric_only=True) → heatmap
  
  Python
  
  PHP
  
  0
  1
  2
  3
  4
  5
  6
  
  sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
- Scatter: sns.scatterplot(x=’Age’, y=’Salary’, hue=’City’, data=df)
- Pairplot: sns.pairplot(df, hue=’City’) (powerful overview)
Feature Insights & Questions
- Groupbys: average salary by city/age bin
- Cross-tabs: pd.crosstab(df[‘City’], pd.cut(df[‘Age’], bins=5))
- Derive features: df[‘Age_Group’] = pd.cut(df[‘Age’], bins=[20,30,40,50,100])
Document Findings — notebook markdown cells: “Salary skewed right — many low earners, few high. Outliers above ₹20L possible CEOs.”

Realistic mini-EDA example (imagine we have a Mumbai housing CSV — common in India DS jobs)

Columns: Price, Area_sqft, Bedrooms, Location, Age_years, etc.

Typical discoveries:

Price highly right-skewed → log transform for modeling.
Strong corr between Area & Price (0.85).
Navi Mumbai cheaper than South Mumbai.
Missing in Location → impute with mode or group by pincode.

That’s Chapter 4 — the daily grind that makes or breaks projects!

Practice tip: Download Titanic from Kaggle (https://www.kaggle.com/c/titanic/data — train.csv), or India house prices (search Kaggle “House Price India”), and run full EDA.

Languages

Database

Web Technologies

Wordpress Tutorial

Top Online Compilers

PHP Projects

CRUD Management
PHP Search
Blog/CMS
E-commerce Website
Event Management System
Online Learning Platform
Task Management System
Social Networking Site
Inventory Management System
Real Estate Listing Website
Job Portal
Discussion Forum
Online Quiz/Test Platform
File Sharing System
Travel Booking System
Expense Management System
Recipe Sharing Platform
Online Survey System
Library Management System
Health and Fitness Tracker
Online Marketplace

Home

About Us

Disclaimer

+91 9433 511 250

Email

info@bestwebteacher.com

Chapter 4: Data Manipulation & Analysis

1. NumPy (Arrays, Broadcasting, Vectorization)

2. Pandas (DataFrames, Series, Indexing, Merging, Grouping, Pivot Tables)

3. Data Cleaning & Wrangling (Missing Values, Outliers, Duplicates)

4. Exploratory Data Analysis (EDA) Workflow

You may also like...

Leave a Reply Cancel reply

Data Science Tutorial

Languages

Database

Web Technologies

Web Technologies

Wordpress Tutorial

Top Online Compilers

PHP Projects

WhatsApp

Email

Connect with us

Chapter 4: Data Manipulation & Analysis

1. NumPy (Arrays, Broadcasting, Vectorization)

2. Pandas (DataFrames, Series, Indexing, Merging, Grouping, Pivot Tables)

3. Data Cleaning & Wrangling (Missing Values, Outliers, Duplicates)

4. Exploratory Data Analysis (EDA) Workflow

You may also like...

Chapter 14: Capstone Projects & Portfolio Building

Chapter 13: Big Data & Scalability (optional but valuable)

Chapter 12: Model Deployment & MLOps Basics (2025 must-have)

Leave a Reply Cancel reply

Data Science Tutorial

Languages

Database

Web Technologies

Web Technologies

Wordpress Tutorial

Top Online Compilers

PHP Projects

WhatsApp

Email

Connect with us