Chapter 4: Data Manipulation & Analysis
Data Manipulation & Analysis, explained in full detail like we’re in a cozy café in Airoli — laptop open, me pointing at code cells in Jupyter Notebook, and explaining step by step why each technique matters for real data science work in 2026. This chapter is where theory turns practical: 70–80% of a data scientist’s time is spent here (cleaning, shaping, understanding data), not fancy modeling.
We’ll use NumPy for fast array math and Pandas for tabular data (the bread-and-butter tools). Then cleaning, and finally a full EDA workflow with a realistic example.
Setup note (do this now if not already):
|
0 1 2 3 4 5 6 |
pip install numpy pandas jupyter |
Or use Anaconda. Work in Jupyter Notebook/Lab for interactive magic.
1. NumPy (Arrays, Broadcasting, Vectorization)
NumPy = Numerical Python. It’s the foundation under Pandas, Scikit-learn, PyTorch — everything fast in data science.
Why NumPy instead of Python lists? Lists are slow for math on thousands/millions of numbers. NumPy uses C under the hood → 10–100x faster.
ndarray (n-dimensional array) — core object.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import numpy as np # Create arrays arr = np.array([10, 20, 30, 40]) # 1D matrix = np.array([[1, 2], [3, 4], [5, 6]]) # 2D print(arr.shape) # (4,) print(matrix.shape) # (3, 2) print(matrix.ndim) # 2 # Special arrays (super common) zeros = np.zeros((3, 4)) # 3×4 matrix of 0s ones = np.ones((2, 5)) # useful for initialization arange = np.arange(0, 100, 5) # like range(): 0,5,10,...,95 linspace = np.linspace(0, 1, 11) # 11 evenly spaced points [0.0 to 1.0] # Math operations — element-wise (vectorized!) arr2 = arr * 2 # [20, 40, 60, 80] — no loops! arr3 = np.sqrt(arr) # square root each element |
Broadcasting — magic that lets you operate on arrays of different shapes (without copying data).
Rule: dimensions match from right, or one is 1 (stretched).
Example (very common in ML feature scaling):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
prices = np.array([1200000, 850000, 2200000, 450000]) # house prices in Navi Mumbai mean_price = prices.mean() # scalar centered = prices - mean_price # broadcasting: subtract scalar from array # Matrix example X = np.random.randn(100, 3) # 100 samples, 3 features means = X.mean(axis=0) # shape (3,) — mean per column X_centered = X - means # broadcasts means across 100 rows |
Vectorization — avoid Python loops; do everything array-wise.
Slow bad way:
|
0 1 2 3 4 5 6 7 8 |
squared = [] for x in arr: squared.append(x**2) |
Fast good way:
|
0 1 2 3 4 5 6 |
squared = arr ** 2 # vectorized, blazing fast |
In 2026: NumPy powers GPU acceleration in PyTorch too — learn it well.
2. Pandas (DataFrames, Series, Indexing, Merging, Grouping, Pivot Tables)
Pandas = Excel on steroids + Python power.
Series — 1D labeled array (like column).
DataFrame — 2D table with labeled rows/columns.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import pandas as pd # Create from dict (common) data = { 'Name': ['Rahul', 'Priya', 'Amit', 'Sneha'], 'Age': [28, 32, 25, 41], 'City': ['Airoli', 'Mumbai', 'Pune', 'Thane'], 'Salary': [850000, 1200000, 650000, 1800000] } df = pd.DataFrame(data) print(df) # Name Age City Salary # 0 Rahul 28 Airoli 850000 # 1 Priya 32 Mumbai 1200000 # ... |
Indexing & Selection (many ways — know these!)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Columns df['Age'] # Series df[['Name', 'Salary']] # DataFrame subset # loc (label-based) — preferred df.loc[1, 'City'] # 'Mumbai' df.loc[:, 'Age':'Salary'] # all rows, columns from Age to Salary # iloc (integer position) df.iloc[0:2, 0:3] # first 2 rows, first 3 columns # Boolean masking (filtering — super powerful) high_salary = df[df['Salary'] > 1000000] young_in_mumbai = df[(df['Age'] < 30) & (df['City'] == 'Mumbai')] |
Merging / Joining (like SQL joins)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# Suppose another DF with bonus info bonus = pd.DataFrame({ 'Name': ['Rahul', 'Priya', 'Amit'], 'Bonus': [50000, 120000, 30000] }) # Merge on 'Name' df_merged = pd.merge(df, bonus, on='Name', how='left') # left join # how='inner', 'outer', 'right' also available |
Grouping & Aggregation (group by city, see average salary)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
grouped = df.groupby('City').agg({ 'Salary': ['mean', 'max', 'count'], 'Age': 'mean' }).round(0) print(grouped) # Salary Age # mean max count mean # City # Airoli 850000 850000 1 28.0 # Mumbai 1200000 1200000 1 32.0 # ... |
Pivot Tables — like Excel pivot (summarize fast)
|
0 1 2 3 4 5 6 7 8 9 10 11 |
pivot = pd.pivot_table(df, values='Salary', index='City', columns=None, aggfunc=['mean', 'count']) print(pivot) |
3. Data Cleaning & Wrangling (Missing Values, Outliers, Duplicates)
Real data is messy — 60–80% of time here.
Missing values (NaN)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
# Check df.isnull().sum() # Options: df_clean = df.dropna() # drop rows with any NaN df['Age'].fillna(df['Age'].median(), inplace=True) # fill with median df.fillna({'Salary': 0, 'City': 'Unknown'}) # column-specific |
Duplicates
|
0 1 2 3 4 5 6 7 |
df.duplicated().sum() df = df.drop_duplicates(subset=['Name'], keep='first') |
Outliers (extreme values that distort)
Common methods:
- IQR method (robust)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
Q1 = df['Salary'].quantile(0.25) Q3 = df['Salary'].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR outliers = df[(df['Salary'] < lower) | (df['Salary'] > upper)] df_no_out = df[(df['Salary'] >= lower) & (df['Salary'] <= upper)] |
- Z-score (for normal-ish data): |z| > 3 → outlier
Other wrangling:
- Type conversion: df[‘Age’] = df[‘Age’].astype(int)
- String cleaning: df[‘City’] = df[‘City’].str.strip().str.lower()
- Rename: df.rename(columns={‘Salary’: ‘Annual_Salary’}, inplace=True)
4. Exploratory Data Analysis (EDA) Workflow
EDA = get to know your data deeply before modeling.
Standard workflow (follow this every time):
- Load & Inspect
Python012345678910df = pd.read_csv('your_data.csv') # or read_excel, read_json, etc.df.head(10)df.info() # types, missingdf.describe() # stats summarydf.shape
- Check Quality
- Missing: df.isnull().sum() / len(df) * 100 (% missing)
- Duplicates: df.duplicated().sum()
- Cardinality: df.nunique() (how many unique values per column)
- Univariate Analysis (one variable)
- Numerical: histogram, boxplot, density
Python0123456789101112import matplotlib.pyplot as pltimport seaborn as snssns.histplot(df['Salary'], kde=True)plt.show()sns.boxplot(x=df['Age'])
- Categorical: value_counts, barplot
Python0123456df['City'].value_counts().plot(kind='bar')
- Numerical: histogram, boxplot, density
- Bivariate / Multivariate
- Correlation: df.corr(numeric_only=True) → heatmap
Python0123456sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
- Scatter: sns.scatterplot(x=’Age’, y=’Salary’, hue=’City’, data=df)
- Pairplot: sns.pairplot(df, hue=’City’) (powerful overview)
- Correlation: df.corr(numeric_only=True) → heatmap
- Feature Insights & Questions
- Groupbys: average salary by city/age bin
- Cross-tabs: pd.crosstab(df[‘City’], pd.cut(df[‘Age’], bins=5))
- Derive features: df[‘Age_Group’] = pd.cut(df[‘Age’], bins=[20,30,40,50,100])
- Document Findings — notebook markdown cells: “Salary skewed right — many low earners, few high. Outliers above ₹20L possible CEOs.”
Realistic mini-EDA example (imagine we have a Mumbai housing CSV — common in India DS jobs)
Columns: Price, Area_sqft, Bedrooms, Location, Age_years, etc.
Typical discoveries:
- Price highly right-skewed → log transform for modeling.
- Strong corr between Area & Price (0.85).
- Navi Mumbai cheaper than South Mumbai.
- Missing in Location → impute with mode or group by pincode.
That’s Chapter 4 — the daily grind that makes or breaks projects!
Practice tip: Download Titanic from Kaggle (https://www.kaggle.com/c/titanic/data — train.csv), or India house prices (search Kaggle “House Price India”), and run full EDA.
