Chapter 6: SQL for Data Science

SQL for Data Science, explained like we’re sitting together in Airoli (maybe at that quiet café near the station with good AC), laptop open, me running queries live on BigQuery or PostgreSQL while you follow along. I’ll teach this the way I’d teach a friend prepping for data roles in Mumbai/Hyderabad companies in 2026: very hands-on, with realistic examples (think e-commerce sales, UPI transactions, employee data — stuff you see in Indian startups/FAANG interviews), why each concept matters for DS work, and common pitfalls.

SQL is still non-negotiable in 2026 for data scientists — even with GenAI tools, you need to pull/clean/aggregate data yourself from warehouses (BigQuery, Snowflake, Redshift, Athena). Interviews test window functions, CTEs, optimization because that’s daily work.

We’ll use a simple but realistic schema (imagine this is your company’s database):

SQL

1. Basics: SELECT, WHERE, ORDER BY, LIMIT

Start simple — but do it cleanly.

SQL

Pro tips (2026 style):

  • Always alias columns if complex: SELECT name AS customer_name
  • Use DATE_TRUNC(‘month’, order_date) or EXTRACT(MONTH FROM order_date) for time grouping (BigQuery/PostgreSQL).
  • Case-insensitive search: WHERE LOWER(city) = ‘airoli’

Common interview twist: “Show top 5 customers by join date in Navi Mumbai area” → add WHERE city LIKE ‘%Navi%’ OR city = ‘Airoli’

2. Joins (INNER, LEFT, RIGHT, FULL), Subqueries

Joins = combining tables. DS work: link orders to customers, products, etc.

SQL

RIGHT JOIN — opposite (all from right table). FULL OUTER JOIN — all from both (rare but useful for finding mismatches).

Subqueries — query inside query (powerful but can be slow if misused).

SQL

Interview favorite: “Find customers who never ordered” → LEFT JOIN + IS NULL.

3. Aggregations, GROUP BY, HAVING

Aggregate = summarize.

SQL

Pitfall: Can’t use alias in WHERE (but can in HAVING or ORDER BY).

4. Window Functions (ROW_NUMBER, RANK, LAG/LEAD)

Window functions = calculate across rows without collapsing (no GROUP BY needed). Game-changer for DS in 2026 — running totals, rankings, time-series diffs.

Syntax base: function() OVER (PARTITION BY … ORDER BY … [ROWS/RANGE])

SQL

LAG / LEAD — compare to previous/next row (huge for time-series, churn analysis, stock prices).

SQL
SQL

Real DS use: Calculate month-over-month growth, detect anomalies (sudden drop = possible churn).

5. CTEs, Query Optimization Basics

CTEs (Common Table Expressions) — named temporary result sets. Cleaner than subqueries, easier to read/debug.

SQL

Why CTEs win in 2026:

  • Readable (name steps like “monthly_sales”, “top_products”)
  • Reusable within query
  • Better than nested subqueries
  • Can be recursive (for hierarchies, like org chart)

Query Optimization Basics (interview must-know)

  1. EXPLAIN / EXPLAIN ANALYZE — see plan (BigQuery: cost estimate, PostgreSQL: actual time).
    • Look for: Full table scans (bad on big data), missing indexes.
  2. Filter early — WHERE before JOIN/GROUP.
  3. Indexes — on join keys (customer_id), WHERE columns (order_date, city).
  4. **Avoid SELECT *** — only needed columns.
  5. Use APPROXIMATE functions if exact not needed (BigQuery: APPROX_COUNT_DISTINCT).
  6. Partitioned/clustered tables — speed up on date/city.

Interview question example: “Query slow on 1B rows orders table — how to speed up monthly revenue by city?”

Answer: Index on order_date + city, use partitioned table by month, filter date range early, use CTE for clarity.

That’s Chapter 6 — the SQL toolkit every data scientist needs!

Practice tip: Use free platforms like BigQuery Sandbox, Mode Analytics, StrataScratch, DataLemur, or LeetCode SQL — solve 50+ problems on windows/CTEs.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *