Chapter 43: R Data Set
R Data Sets (or “datasets” in R language).
This topic sounds simple, but it’s actually very important — because almost every tutorial, book, course, YouTube video, and Stack Overflow answer starts with built-in data sets. Understanding them early saves you a lot of confusion later.
Let’s go slowly, like we’re sitting together in RStudio with two screens — whiteboard style, patient, real examples, common traps, and the 2026 reality.
1. What Actually is an “R Data Set”?
In R, a data set (written as dataset or data set) usually means:
A pre-loaded data frame (or sometimes a tibble/matrix/list) that comes built-in with R or with one of the packages you have installed.
These data sets exist so that:
- Teachers / books / tutorials can show examples without asking you to download files
- You can practice statistics, plotting, modeling immediately after installing R
- Package authors can show how their functions work using real(ish) data
They are not files on your disk (usually) — they live inside R packages as special objects.
2. Two Kinds of Data Sets in R
Type A — Pre-loaded / always available → Loaded automatically when you start R or load the datasets package
Type B — Lazy-loaded / on-demand → Only loaded into memory when you explicitly call data(name) or data(name, package = “…”)
Most famous ones belong to Type B — that’s why you see data(iris) or data(mtcars) in almost every tutorial.
3. How to See All Available Data Sets Right Now
Run this in your RStudio console:
|
0 1 2 3 4 5 6 7 8 9 10 |
# See all data sets from all currently loaded packages data(package = .packages(all.available = TRUE)) # Classic one-liner — very useful data() # opens viewer with ALL available data sets |
You’ll see hundreds — but only ~20–30 are used in 95% of teaching and tutorials.
4. The Most Famous & Most Used R Built-in Data Sets (2026 Edition)
| Data set | Package | Rows × Columns | What it contains | Most common use in tutorials |
|---|---|---|---|---|
| iris | datasets | 150 × 5 | Measurements of 3 iris flower species | Scatter plots, classification, clustering |
| mtcars | datasets | 32 × 11 | 1974 Motor Trend car data (mpg, hp, wt, cyl…) | Regression, correlation, t-tests |
| diamonds | ggplot2 | 53940 × 10 | Diamond prices & characteristics | ggplot2 teaching, large data examples |
| mpg | ggplot2 | 234 × 11 | Fuel economy data from 1999–2008 | Faceting, grouping, modern ggplot |
| gapminder | gapminder | 1704 × 6 | Life expectancy, GDP, population by country/year | Time series, animation, dplyr |
| Titanic | datasets | 891 × 12 | Titanic passenger survival data | Logistic regression, classification |
| AirPassengers | datasets | 144 × 1 | Monthly airline passengers 1949–1960 | Time series, forecasting |
| faithful | datasets | 272 × 2 | Old Faithful geyser eruption times & waiting | Clustering, density plots |
| swiss | datasets | 47 × 6 | Swiss fertility & socio-economic indicators 1888 | PCA, regression |
| CO2 | datasets | 468 × 5 | Carbon dioxide uptake in grass plants | Nonlinear models, repeated measures |
5. How to Load & Use Them (Hands-on)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# 1. iris – the most famous one data(iris) # load it (sometimes not needed — auto-loads) head(iris) # first 6 rows # Quick modern look library(skimr) skim(iris) # Classic scatter plot plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species, pch = 19, main = "Iris – Sepal vs Petal Length") |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# 2. mtcars – most used for regression examples data(mtcars) # Quick correlation matrix cor(mtcars[, c("mpg", "wt", "hp", "disp")]) # Linear model – classic example model <- lm(mpg ~ wt + hp, data = mtcars) summary(model) |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# 3. diamonds – large dataset for ggplot2 library(ggplot2) data(diamonds) # Very common teaching plot ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point(alpha = 0.3, size = 1) + scale_y_log10() + theme_minimal() + labs(title = "Diamond Price vs Carat by Clarity") |
6. Common Beginner Traps & 2026 Tips
Trap 1 — Thinking data(iris) is always necessary
→ In modern RStudio + tidyverse workflows, many data sets auto-load when you call them.
Trap 2 — Overwriting built-in names
|
0 1 2 3 4 5 6 |
iris <- read.csv("my_own_iris.csv") # ← BAD! Overwrites built-in iris |
Tip → use different name: my_iris <- read.csv(…)
Trap 3 — Not knowing where a data set comes from
→ Always check: ?mtcars or data(mtcars, package = “datasets”)
Tip 2026 — Use data(package = .packages()) to see what’s available right now.
Your Mini Practice Right Now
Copy this block — run it and play:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# Load & explore mtcars data(mtcars) # Quick modern summary library(skimr) skim(mtcars) # Scatter + regression line (classic) library(ggplot2) ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point(size = 3, alpha = 0.85) + geom_smooth(method = "lm", se = FALSE) + labs( title = "Miles per Gallon vs Weight by Cylinders", x = "Weight (1000 lbs)", y = "Miles per Gallon" ) + theme_minimal() |
Now try these experiments:
- Change cyl to factor(gear) or factor(am)
- Add facet_wrap(~ cyl)
- Try data(“diamonds”) and plot carat vs price
You just did real R statistics exploration using built-in data sets!
Feeling comfortable?
Next logical steps?
- Want to do first real t-test / regression on mtcars or iris?
- Learn how to import your own CSV / Excel as data set?
- Explore gapminder or palmerpenguins (very popular modern teaching data)?
- Or jump to first statistical test (t-test, correlation)?
Just tell me — whiteboard is ready! 📊🧮🚀
