Chapter 12: MongoDB Aggregation Pipelines
MongoDB Aggregation Pipelines. β€οΈππ
This is not just another query. This is where MongoDB becomes a full-fledged data transformation & analytics engine inside the database β no need to pull millions of documents to your app and process them in Node.js/Python.
1. What Exactly is an Aggregation Pipeline? (Big Picture First)
Imagine your collection is a factory production line:
- Raw materials = documents coming in
- Each stage = one machine that does one specific job (filter, group, reshape, sort, join, calculateβ¦)
- The output of one machine β input to the next machine
- Final product = summarized, transformed, enriched data
In MongoDB terms:
|
0 1 2 3 4 5 6 7 8 9 10 11 |
db.collection.aggregate([ { $stage1: { ... } }, { $stage2: { ... } }, { $stage3: { ... } }, // ... as many stages as you need ]) |
- Order matters a lot (usually start with $match to filter early)
- Each stage processes documents one by one or in groups
- Very efficient β MongoDB can use indexes in early stages
- Can output to a new collection ($out, $merge), return cursor, or just show results
Think of it as MongoDB’s version of SQL GROUP BY + JOIN + HAVING + SELECT + ORDER BY + LIMIT β but way more flexible and composable.
2. Most Important Stages (The Building Blocks β Learn These First!)
Here are the top 10 stages you’ll use 95% of the time in real projects (2026 style):
| Stage | Purpose (one-liner) | SQL equivalent (approx) | When to use it (real life) | Position in pipeline (typical) |
|---|---|---|---|---|
| $match | Filter documents (like find()) | WHERE | Early β reduce data fast, use indexes | Usually stage 1 |
| $group | Group & calculate aggregates (sum, avg, countβ¦) | GROUP BY + aggregates | Core analytics: totals, averages per category | After $match |
| $project | Reshape documents (include/exclude/rename/new fields) | SELECT specific columns + expressions | Clean output, create computed fields | Near end |
| $sort | Sort documents | ORDER BY | Present results nicely | After group/project |
| $limit / $skip | Paginate results | LIMIT / OFFSET | Show page 3 of 10 results | End of pipeline |
| $lookup | Join with another collection (left outer join) | LEFT JOIN | Enrich data: add user info to orders | After $match |
| $unwind | Deconstruct array β one doc per array element | β | Handle arrays before grouping | Before $group on arrays |
| $addFields | Add new computed fields (like $project but keeps all) | β | Add calculated columns | Mid/late |
| $out / $merge | Write results to new/existing collection | CREATE TABLE AS SELECT | Materialized views, reporting tables | Last stage |
| $count | Count documents (after previous stages) | COUNT(*) | Quick total after filters | End |
3. Hands-on Example β Real Mini Project (Movie Reviews Analytics)
Let’s create a small dataset in mongosh (you can copy-paste):
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
use movieAnalytics2026 db.movies.insertMany([ { title: "RRR", year: 2022, genres: ["Action", "Drama"], rating: 8.1, country: "India", revenue: 1200, comments: [{user: "Rahul", text: "Epic!"}, {user: "Priya", text: "Loved it"}] }, { title: "Kalki 2898 AD", year: 2024, genres: ["Sci-Fi", "Action"], rating: 8.0, country: "India", revenue: 1100 }, { title: "Pushpa 2", year: 2025, genres: ["Action", "Thriller"], rating: 8.5, country: "India", revenue: 1500 }, { title: "Oppenheimer", year: 2023, genres: ["Drama", "History"], rating: 8.4, country: "USA", revenue: 950 } ]) |
Now β let’s build pipelines step by step.
Pipeline 1: Average rating per genre (basic group)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
db.movies.aggregate([ { $unwind: "$genres" }, // one doc per genre { $group: { _id: "$genres", // group by genre avgRating: { $avg: "$rating" }, // average rating movieCount: { $sum: 1 }, // count movies totalRevenue: { $sum: "$revenue" } }}, { $sort: { avgRating: -1 } }, // highest rated first { $project: { genre: "$_id", averageRating: { $round: ["$avgRating", 2] }, moviesInGenre: "$movieCount", totalBoxOfficeCr: "$totalRevenue", _id: 0 }} ]) |
Sample output:
|
0 1 2 3 4 5 6 7 8 9 10 |
{ "genre": "Action", "averageRating": 8.2, "moviesInGenre": 3, "totalBoxOfficeCr": 3800 } { "genre": "Drama", "averageRating": 8.25, "moviesInGenre": 2, "totalBoxOfficeCr": 2150 } { "genre": "Sci-Fi", "averageRating": 8.0, "moviesInGenre": 1, "totalBoxOfficeCr": 1100 } { "genre": "Thriller", "averageRating": 8.5, "moviesInGenre": 1, "totalBoxOfficeCr": 1500 } { "genre": "History", "averageRating": 8.4, "moviesInGenre": 1, "totalBoxOfficeCr": 950 } |
Pipeline 2: Top countries by revenue + join user comments count
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
db.movies.aggregate([ { $match: { revenue: { $exists: true } } }, // only movies with revenue { $lookup: { from: "movies", // self-join example (or another collection) localField: "_id", foreignField: "_id", as: "commentDetails" }}, { $addFields: { commentCount: { $size: { $ifNull: ["$comments", []] } } }}, { $group: { _id: "$country", totalRevenueCr: { $sum: "$revenue" }, avgRating: { $avg: "$rating" }, movieCount: { $sum: 1 }, totalComments: { $sum: "$commentCount" } }}, { $sort: { totalRevenueCr: -1 } }, { $project: { country: "$_id", totalRevenueCr: 1, averageRating: { $round: ["$avgRating", 1] }, movies: "$movieCount", commentsReceived: "$totalComments", _id: 0 }} ]) |
4. Quick Reference Table β Your Pipeline Cheat Sheet
| Goal | Typical Stages Order | Key Operators Inside |
|---|---|---|
| Filter + group + sort | $match β $group β $sort | $avg, $sum, $count |
| Join + enrich | $match β $lookup β $unwind β $project | $lookup, $arrayElemAt |
| Paginated top 10 | $match β $group β $sort β $skip β $limit | $sort: { field: -1 } |
| Materialized report | … β $merge: { into: “reports” } | $merge / $out |
| Handle arrays before aggregate | $unwind β $group | $unwind (preserveNullAndEmptyArrays) |
5. Mini Exercise β Try These in mongosh Right Now!
- Find average rating per country
- Count movies per year (add $group: { _id: “$year”, count: { $sum: 1 } })
- Find movies with > 2 genres (use $size: “$genres”)
- Build a pipeline that shows only Indian movies sorted by revenue descending
Understood beta? Aggregation pipelines are what separate beginners from pros β once you master this, you can do almost anything with data inside MongoDB.
Next class β what excites you?
- Deep dive into $lookup (joins) with examples?
- $unwind + $group patterns for arrays?
- Atlas Search inside aggregation (vector + full-text)?
- Performance tips: indexes, $match early, explain plans?
- Or a small real project (sales dashboard, user analytics)?
Tell me β class is ready for the next level! πβ€οΈ
Any confusion so far? Ask anything β we’re in this together! π
