Chapter 6: SQL for Data Science
SQL for Data Science, explained like we’re sitting together in Airoli (maybe at that quiet café near the station with good AC), laptop open, me running queries live on BigQuery or PostgreSQL while you follow along. I’ll teach this the way I’d teach a friend prepping for data roles in Mumbai/Hyderabad companies in 2026: very hands-on, with realistic examples (think e-commerce sales, UPI transactions, employee data — stuff you see in Indian startups/FAANG interviews), why each concept matters for DS work, and common pitfalls.
SQL is still non-negotiable in 2026 for data scientists — even with GenAI tools, you need to pull/clean/aggregate data yourself from warehouses (BigQuery, Snowflake, Redshift, Athena). Interviews test window functions, CTEs, optimization because that’s daily work.
We’ll use a simple but realistic schema (imagine this is your company’s database):
|
0 1 2 3 4 5 6 7 8 9 |
-- Tables for examples customers (customer_id, name, city, join_date) orders (order_id, customer_id, order_date, amount, product_category, status) products (product_id, name, category, price) |
1. Basics: SELECT, WHERE, ORDER BY, LIMIT
Start simple — but do it cleanly.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
-- Basic select SELECT customer_id, name, city FROM customers; -- Filter + sort + limit (very common for quick checks) SELECT name, city, join_date FROM customers WHERE city = 'Airoli' -- string single quotes AND join_date >= '2025-01-01' ORDER BY join_date DESC -- newest first LIMIT 10; -- top 10 recent joins |
Pro tips (2026 style):
- Always alias columns if complex: SELECT name AS customer_name
- Use DATE_TRUNC(‘month’, order_date) or EXTRACT(MONTH FROM order_date) for time grouping (BigQuery/PostgreSQL).
- Case-insensitive search: WHERE LOWER(city) = ‘airoli’
Common interview twist: “Show top 5 customers by join date in Navi Mumbai area” → add WHERE city LIKE ‘%Navi%’ OR city = ‘Airoli’
2. Joins (INNER, LEFT, RIGHT, FULL), Subqueries
Joins = combining tables. DS work: link orders to customers, products, etc.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
-- INNER JOIN: only matching records SELECT c.name, o.order_id, o.amount FROM customers c INNER JOIN orders o ON c.customer_id = o.customer_id WHERE o.amount > 5000; -- LEFT JOIN: all customers, even without orders (great for "inactive users") SELECT c.name, c.city, o.order_id, COALESCE(o.amount, 0) AS order_amount FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id WHERE o.order_id IS NULL; -- customers with NO orders! |
RIGHT JOIN — opposite (all from right table). FULL OUTER JOIN — all from both (rare but useful for finding mismatches).
Subqueries — query inside query (powerful but can be slow if misused).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
-- Customers who spent more than average SELECT c.name, SUM(o.amount) AS total_spent FROM customers c JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.name HAVING SUM(o.amount) > ( SELECT AVG(total) FROM ( SELECT SUM(amount) AS total FROM orders GROUP BY customer_id ) avg_spend ); |
Interview favorite: “Find customers who never ordered” → LEFT JOIN + IS NULL.
3. Aggregations, GROUP BY, HAVING
Aggregate = summarize.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
-- Basic aggregates SELECT COUNT(*) AS total_orders, SUM(amount) AS total_revenue, AVG(amount) AS avg_order_value, MIN(amount) AS smallest_order, MAX(amount) AS biggest_order FROM orders; -- Group by category (very common in DS reports) SELECT product_category, COUNT(order_id) AS num_orders, SUM(amount) AS category_revenue, ROUND(AVG(amount), 2) AS avg_per_order FROM orders GROUP BY product_category ORDER BY category_revenue DESC; -- HAVING filters AFTER group (WHERE is before) SELECT customer_id, SUM(amount) AS total_spent FROM orders GROUP BY customer_id HAVING SUM(amount) > 100000 -- high-value customers AND COUNT(order_id) >= 5 -- at least 5 orders ORDER BY total_spent DESC LIMIT 20; |
Pitfall: Can’t use alias in WHERE (but can in HAVING or ORDER BY).
4. Window Functions (ROW_NUMBER, RANK, LAG/LEAD)
Window functions = calculate across rows without collapsing (no GROUP BY needed). Game-changer for DS in 2026 — running totals, rankings, time-series diffs.
Syntax base: function() OVER (PARTITION BY … ORDER BY … [ROWS/RANGE])
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
-- ROW_NUMBER(): unique row number per partition SELECT customer_id, order_date, amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date) AS order_sequence FROM orders ORDER BY customer_id, order_date; -- RANK() vs DENSE_RANK() vs ROW_NUMBER() -- Suppose salaries in a department SELECT name, department, salary, ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS row_num, -- 1,2,3,... always unique RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rank, -- 1,2,2,4 if ties DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rank -- 1,2,2,3 (no gaps) FROM employees; |
LAG / LEAD — compare to previous/next row (huge for time-series, churn analysis, stock prices).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
-- Previous order amount for each customer SELECT customer_id, order_date, amount, LAG(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_amount, amount - LAG(amount, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_change FROM orders WHERE amount_change > 0; -- orders bigger than previous one |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
-- Next order date (for repeat purchase analysis) SELECT customer_id, order_date, LEAD(order_date, 1) OVER (PARTITION BY customer_id ORDER BY order_date) AS next_order_date, DATE_DIFF(LEAD(order_date, 1) OVER (PARTITION BY customer_id ORDER BY order_date), order_date, DAY) AS days_to_next FROM orders; |
Real DS use: Calculate month-over-month growth, detect anomalies (sudden drop = possible churn).
5. CTEs, Query Optimization Basics
CTEs (Common Table Expressions) — named temporary result sets. Cleaner than subqueries, easier to read/debug.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
WITH high_value_customers AS ( SELECT customer_id, SUM(amount) AS total_spent FROM orders GROUP BY customer_id HAVING SUM(amount) > 50000 ), recent_orders AS ( SELECT o.customer_id, o.order_date, o.amount FROM orders o INNER JOIN high_value_customers hvc ON o.customer_id = hvc.customer_id WHERE o.order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY) ) SELECT roc.customer_id, c.name, COUNT(roc.order_id) AS recent_orders_count, SUM(roc.amount) AS recent_revenue FROM recent_orders roc JOIN customers c ON roc.customer_id = c.customer_id GROUP BY roc.customer_id, c.name ORDER BY recent_revenue DESC; |
Why CTEs win in 2026:
- Readable (name steps like “monthly_sales”, “top_products”)
- Reusable within query
- Better than nested subqueries
- Can be recursive (for hierarchies, like org chart)
Query Optimization Basics (interview must-know)
- EXPLAIN / EXPLAIN ANALYZE — see plan (BigQuery: cost estimate, PostgreSQL: actual time).
- Look for: Full table scans (bad on big data), missing indexes.
- Filter early — WHERE before JOIN/GROUP.
- Indexes — on join keys (customer_id), WHERE columns (order_date, city).
- **Avoid SELECT *** — only needed columns.
- Use APPROXIMATE functions if exact not needed (BigQuery: APPROX_COUNT_DISTINCT).
- Partitioned/clustered tables — speed up on date/city.
Interview question example: “Query slow on 1B rows orders table — how to speed up monthly revenue by city?”
Answer: Index on order_date + city, use partitioned table by month, filter date range early, use CTE for clarity.
That’s Chapter 6 — the SQL toolkit every data scientist needs!
Practice tip: Use free platforms like BigQuery Sandbox, Mode Analytics, StrataScratch, DataLemur, or LeetCode SQL — solve 50+ problems on windows/CTEs.
