Chapter 5: Pandas Read CSV
Pandas – Reading CSV Files (the most common first real task)
Almost every data project starts with:
|
0 1 2 3 4 5 6 7 |
import pandas as pd df = pd.read_csv("some_file.csv") |
But in real life, very few CSVs are clean and simple. That’s why we need to understand the most important options and how to troubleshoot.
1. The most basic read (when everything is perfect)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Folder structure example: # project/ # └── data/ # └── sales_2025.csv df = pd.read_csv("data/sales_2025.csv") # Quick checks right after reading print(df.shape) df.head(6) df.info() df.columns |
This works when:
- File is in current working directory or you give correct path
- First row contains column names
- Separator is comma (,)
- No strange encoding
- No extra blank lines at top/bottom
But in reality… almost never all these are true.
2. The realistic “good starting command” most people use
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
df = pd.read_csv( "data/sales_jan_2025.csv", sep=",", # default anyway encoding="utf-8", # most common modern encoding on_bad_lines="warn", # don't crash on bad rows (pandas 1.3+) low_memory=False, # helps with mixed types warnings dtype_backend="numpy_nullable" # modern pandas (2.0+), better null handling ) |
Even better version (very common in 2025):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
df = pd.read_csv( "data/customers.csv", sep=",", encoding="utf-8", encoding_errors="replace", # replace bad characters instead of crashing on_bad_lines="skip", # skip bad rows (or "warn", "error") dtype={ 'customer_id': 'string', # prevent float64 for IDs 'phone': 'string', 'pincode': 'string', 'order_date': 'string' # we'll convert later }, parse_dates=['order_date'], # try to convert these to datetime dayfirst=False, # False = MM/DD/YYYY, True = DD/MM/YYYY na_values=['', 'NA', 'N/A', '-', 'null', 'None'] # common missing markers ) |
3. Most common real-world problems & how to fix them
Problem 1 – Wrong separator (very common!)
Many files use ; or tab or instead of ,
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
# European style often uses ; df = pd.read_csv("data/eu_sales.csv", sep=";") # Tab separated df = pd.read_csv("data/export.txt", sep="\t") # Pipe separated df = pd.read_csv("data/database_dump.csv", sep="|") |
Quick way to check separator:
|
0 1 2 3 4 5 6 7 |
with open("data/sales.csv", "r", encoding="utf-8") as f: print(f.readline()) |
Look at the first line — what character separates the fields?
Problem 2 – First few rows are not data (header, comments, title)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
# Skip 2 lines at the top df = pd.read_csv("data/report.csv", skiprows=2) # Skip specific lines (e.g. lines 1,3,8) df = pd.read_csv("data/report.csv", skiprows=[0,2,7]) # Use row 3 as header (0-based) df = pd.read_csv("data/report.csv", header=2) # row 3 becomes columns # No header at all → numbers 0,1,2,... df = pd.read_csv("data/no_header.csv", header=None) |
Problem 3 – Missing values written in strange ways
|
0 1 2 3 4 5 6 7 8 9 10 |
# Tell pandas what should be considered NaN df = pd.read_csv( "data/survey.csv", na_values=["-", "N.A.", "missing", "?", "nil", "no data", ""] ) |
Problem 4 – Dates are read as strings
|
0 1 2 3 4 5 6 7 8 9 10 11 |
# Best way: let pandas try to parse them df = pd.read_csv("data/orders.csv", parse_dates=['order_date', 'ship_date']) # If it fails, read as string first then convert df = pd.read_csv("data/orders.csv", dtype={'order_date': 'string'}) df['order_date'] = pd.to_datetime(df['order_date'], format='%d-%m-%Y', errors='coerce') |
Problem 5 – Large files → memory issues or slow loading
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Read only some columns df = pd.read_csv( "data/big_file_2025.csv", usecols=['order_id', 'customer_id', 'amount', 'order_date', 'status'] ) # Read in chunks (very useful for huge files > few GB) chunk_iter = pd.read_csv("data/huge_file.csv", chunksize=100000) for chunk in chunk_iter: # process each chunk print(chunk.shape) # ... do something ... |
4. Real-life example – reading a messy Indian sales CSV
Let’s pretend we have a file sales_data.csv with these issues:
- Separator is ;
- First line is title
- Second line is column description
- Dates are in DD-MM-YYYY
- Missing values are – or blank
- Customer ID and phone should stay as string
- File uses windows-1252 encoding (common in older Indian systems)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
df = pd.read_csv( "sales_data.csv", sep=";", skiprows=2, # skip title + description encoding="windows-1252", parse_dates=['Invoice Date'], date_format="%d-%m-%Y", # important! na_values=["-", "", "N/A"], dtype={ 'Customer ID': 'string', 'Mobile': 'string', 'PIN Code': 'string', 'GSTIN': 'string' }, on_bad_lines="warn" ) # After reading df['Invoice Date'] = pd.to_datetime(df['Invoice Date'], errors='coerce') print(df.info()) df.head(8) |
5. Quick checklist when reading a CSV
- Look at the file first (open in text editor or head -n 5 file.csv in terminal)
- What is the separator?
- How many header rows to skip?
- Are there dates? In which format?
- What do missing values look like?
- Which columns should not become float? (IDs, phone, pincode…)
- What encoding? (try utf-8 → latin1 → windows-1252 → utf-8-sig)
6. Your turn – small practice task
Try to write the read_csv command for these situations:
A. File uses comma, but first 3 lines are comments starting with #
B. File has no header, columns should be named id, name, score, dept
C. File uses tab separator, dates are YYYY/MM/DD, missing = NULL
D. You only want columns order_id, amount, customer_city, order_date
Write your versions — then we can check together.
Where do you want to go next?
- How to inspect & understand the data after reading
- Dealing with messy column names (spaces, capitals, special chars)
- Fixing wrong data types after reading
- Handling very large CSVs (chunks, dask, filtering while reading)
- Reading multiple CSV files at once (common pattern)
- Common encoding nightmares & how to detect them
Tell me which one you want to dive into next — I’ll explain slowly with real examples. 😊
