Chapter 6: Pandas Read JSON
What is JSON and why do we read it with pandas?
JSON (JavaScript Object Notation) is one of the most common data formats today — especially when:
- Data comes from APIs (REST APIs, web services)
- Data is exported from NoSQL databases (MongoDB, Firebase, etc.)
- Data is stored in modern log systems, configuration files, scraped data, etc.
JSON can be:
- One big object { … }
- An array of objects[ {…}, {…}, {…} ] ← this is the most common case for pandas
- Nested objects, arrays inside objects, etc.
pandas is excellent at turning the array-of-objects style into a clean DataFrame.
1. The most common & clean case – array of objects
File users.json:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
[ { "name": "Priya Sharma", "age": 28, "city": "Bangalore", "active": true, "scores": [85, 92, 78], "joined": "2023-05-12" }, { "name": "Rahul Verma", "age": 34, "city": "Hyderabad", "active": false, "scores": [67, 71], "joined": "2021-11-03" }, { "name": "Ananya Roy", "age": 22, "city": "Pune", "active": true, "scores": [94, 88, 96, 91], "joined": "2024-01-19" } ] |
Reading it:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
import pandas as pd df = pd.read_json("users.json") # Quick checks print(df.shape) df.head() df.info() |
Result:
|
0 1 2 3 4 5 6 7 8 9 |
name age city active scores joined 0 Priya Sharma 28 Bangalore True [85, 92, 78] 2023-05-12 1 Rahul Verma 34 Hyderabad False [67, 71] 2021-11-03 2 Ananya Roy 22 Pune True [94, 88, 96, 91] 2024-01-19 |
→ pandas automatically turns the array into rows, keys into columns
2. Most useful & realistic read_json() options
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
df = pd.read_json( "users.json", orient="records", # default when it's array of objects dtype={ "age": "Int64", # nullable integer (allows NaN) "active": "boolean", # pandas nullable boolean "joined": "string" # we'll convert to datetime later }, convert_dates=["joined"], # try to convert these columns to datetime encoding="utf-8", encoding_errors="replace", lines=False # False = whole file is one JSON, True = JSON Lines ) |
Very common modern version (2025 style):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
df = pd.read_json( "data/api_response.json", orient="records", dtype_backend="numpy_nullable", # modern pandas 2.0+ style convert_dates=["timestamp", "created_at"], date_unit="ms", # if dates are milliseconds since epoch encoding="utf-8" ) |
3. Important variations of JSON structure
Case 1: JSON Lines format (very common in logs & big data)
File events.jsonl (one object per line):
|
0 1 2 3 4 5 6 7 8 |
{"event": "login", "user_id": "u101", "time": "2025-02-07T09:12:34Z"} {"event": "purchase", "user_id": "u456", "time": "2025-02-07T09:15:11Z", "amount": 1499} {"event": "logout", "user_id": "u101", "time": "2025-02-07T10:45:22Z"} |
Read it:
|
0 1 2 3 4 5 6 |
df = pd.read_json("events.jsonl", lines=True) |
→ lines=True is the key here
Case 2: Single object with nested data (split / index / columns orient)
File stats.json:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{ "index": ["2025-01", "2025-02", "2025-03"], "columns": ["sales", "profit", "orders"], "data": [ [340000, 92000, 145], [410000, 118000, 172], [389000, 105000, 158] ] } |
|
0 1 2 3 4 5 6 |
df = pd.read_json("stats.json", orient="split") |
→ orient=”split” is specifically for this structure
Case 3: Columns as top-level keys
File by_city.json:
|
0 1 2 3 4 5 6 7 8 9 10 |
{ "Bangalore": {"sales": 560000, "orders": 230, "avg_rating": 4.32}, "Hyderabad": {"sales": 420000, "orders": 180, "avg_rating": 4.15}, "Pune": {"sales": 310000, "orders": 135, "avg_rating": 4.08} } |
|
0 1 2 3 4 5 6 7 |
df = pd.read_json("by_city.json", orient="index") df = df.reset_index().rename(columns={"index": "city"}) |
4. Common real-world problems & fixes
Problem 1: Dates are strings
|
0 1 2 3 4 5 6 7 8 9 10 |
# Best solution: let pandas try during read df = pd.read_json("orders.json", convert_dates=["order_date", "ship_date"]) # If it fails → do it manually after reading df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce") |
Problem 2: Nested objects / lists inside cells
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{ "name": "Priya", "address": { "street": "MG Road 12", "city": "Bangalore", "pin": "560001" }, "orders": [ {"id": 1001, "amount": 2499}, {"id": 1007, "amount": 899} ] } |
→ pandas keeps them as dict/list — you need to normalize/flatten later
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
# Most common solution: json_normalize from pandas import json_normalize df_flat = json_normalize(data) # if data is already loaded as python list/dict # or directly from file df_flat = pd.json_normalize("users_nested.json") |
Problem 3: Very large JSON file
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
# Read line by line (JSON Lines) df = pd.read_json("huge_logs.jsonl", lines=True, chunksize=50000) for chunk in df: # process chunk by chunk print(chunk.shape) # ... do filtering, aggregation ... |
5. Quick reference – most useful read_json() patterns
| Situation | Command / Option |
|---|---|
| Array of objects (most common) | pd.read_json(“file.json”) |
| One JSON object per line | pd.read_json(“file.jsonl”, lines=True) |
| Dates should be parsed | convert_dates=[“date_col1”, “date_col2”] |
| Keep IDs as string | dtype={“id”: “string”, “phone”: “string”} |
| Nested data → flat table | pd.json_normalize(data) |
| JSON with “index”, “columns”, “data” | orient=”split” |
| Columns are top-level keys | orient=”index” |
| Large file | chunksize=100000 |
6. Small realistic practice task
Create a file products.json with this content (or copy-paste into a text file):
|
0 1 2 3 4 5 6 7 8 9 10 |
[ {"id": "p001", "name": "Wireless Mouse", "price": 899, "stock": 45, "category": "Electronics"}, {"id": "p002", "name": "Coffee Mug", "price": 299, "stock": 120, "category": "Home"}, {"id": "p003", "name": "Notebook 15\"", "price": 74990, "stock": 8, "category": "Electronics"} ] |
Try these:
- Read it normally
- Read it and convert price to float, stock to nullable int
- Add a column low_stock = stock < 20
- Sort by price descending
Then try the same file but saved as JSON Lines (one object per line).
Where would you like to go next?
- How to flatten nested JSON properly with json_normalize
- Combining multiple JSON files
- Dealing with deeply nested data (real API examples)
- Converting JSON → DataFrame → clean & analysis
- Common API → pandas workflow
- Handling invalid / broken JSON files
Just tell me which direction feels most useful right now — I’ll continue with detailed examples and explanations. 😊
