Chapter 42: R Statistics Introduction
Introduction to R Statistics — not just a list of functions, but the big picture, the philosophy, the real workflow that people actually use in 2026, and plenty of hands-on examples you can copy-paste right now.
Think of this as our first proper “statistics class” together — calm, patient, no rush, whiteboard style.
1. Why R Became (and Still Is) the King of Statistics
R was literally born for statistics:
- Created in 1993 by Ross Ihaka and Robert Gentleman at the University of Auckland
- Designed as a free, open-source successor to the S language (Bell Labs, 1970s–80s)
- Goal: make statistical computing and graphics easy, flexible, and reproducible
In 2026 the reality is:
- Academia (almost every stats, biostats, psych, economics, ecology PhD uses R)
- Pharma / clinical trials (CDISC standards, FDA submissions — R is dominant)
- Government / official statistics (NSSO, RBI, UN, WHO reports)
- Market research / survey analysis
- Bioinformatics / genomics (Bioconductor ecosystem)
- Econometrics / finance (quant research, risk modeling)
Python is more popular overall, but R still wins when the main goal is classical statistics, publication-quality analysis, or reproducible research reports.
2. The Three Layers of R Statistics (How Real People Use It in 2026)
Layer 1 – Base R + stats package (always available, very fast)
Layer 2 – Classic / traditional packages (still heavily used)
- MASS, boot, survival, nlme, lme4, car, multcomp
Layer 3 – Modern tidyverse-style ecosystem (dominant among new users 2020–2026)
- tidymodels (modeling)
- easystats (insight, performance, parameters, report, see)
- rstatix (tidy t-tests, ANOVA, etc.)
- infer (modern simulation-based inference)
- broom + broom.mixed (tidy model outputs)
- gtsummary / modelsummary (beautiful tables)
- ggstatsplot / see (statistical visualizations)
Most people mix them:
- Quick test? → base t.test() or rstatix::t_test()
- Modeling? → lm() / glm() + broom + performance
- Reporting? → gtsummary or modelsummary + Quarto / R Markdown
3. Real Introductory Workflow – What a Typical First Analysis Looks Like
Let’s do a complete small analysis together — step by step — so you see the actual rhythm.
We’ll use the built-in mtcars dataset (classic — fuel efficiency of 1970s cars).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# ─────────────────────────────────────────────────────────────── # 1. Load & quick look (always first step!) # ─────────────────────────────────────────────────────────────── data(mtcars) # Modern quick overview (highly recommended) library(skimr) skim(mtcars) # Or classic str(mtcars) summary(mtcars) |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
# ─────────────────────────────────────────────────────────────── # 2. Visualization before any test (never skip this!) # ─────────────────────────────────────────────────────────────── library(ggplot2) library(ggstatsplot) ggbetweenstats( data = mtcars, x = factor(cyl), # number of cylinders y = mpg, # miles per gallon type = "nonparametric", # Kruskal-Wallis instead of ANOVA title = "Fuel Efficiency by Number of Cylinders", xlab = "Cylinders", ylab = "Miles per Gallon" ) |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# ─────────────────────────────────────────────────────────────── # 3. Classical one-way ANOVA (or non-parametric) # ─────────────────────────────────────────────────────────────── # Base R anova_result <- aov(mpg ~ factor(cyl), data = mtcars) summary(anova_result) # Modern tidy version library(rstatix) mtcars %>% anova_test(mpg ~ factor(cyl)) # Post-hoc if significant (Tukey HSD) TukeyHSD(anova_result) |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# ─────────────────────────────────────────────────────────────── # 4. Linear regression – does weight predict mpg? # ─────────────────────────────────────────────────────────────── model <- lm(mpg ~ wt + hp + factor(cyl), data = mtcars) # Modern tidy summary library(broom) tidy(model, conf.int = TRUE) # coefficients + CIs glance(model) # R², AIC, etc. # Diagnostic plots (super important!) library(performance) check_model(model) |
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# ─────────────────────────────────────────────────────────────── # 5. Beautiful regression table for report # ─────────────────────────────────────────────────────────────── library(modelsummary) modelsummary(model, title = "Linear Model: MPG Explained by Weight, Horsepower & Cylinders", stars = TRUE, coef_rename = c("wt" = "Weight (1000 lbs)", "hp" = "Horsepower", "factor(cyl)6" = "6 cylinders", "factor(cyl)8" = "8 cylinders"), notes = "Data: mtcars dataset (1974 Motor Trend US magazine)") |
4. Quick Summary – The 2026 Beginner-to-Intermediate Path
Week 1–2 → Descriptive stats + visualization (skimr, ggplot2, GGally::ggpairs())
Week 3–4 → Hypothesis tests (t.test, wilcox.test, chisq.test, cor.test) + rstatix
Week 5–8 → Linear models (lm, glm) + diagnostics (performance)
Week 9+ → Mixed models (lme4), tidymodels, Bayesian (brms), survival, etc.
Final Teacher Advice
R statistics is not about memorizing 100 functions — it’s about learning one good workflow and reusing it:
- Look at data → visualize → describe
- Choose appropriate test/model
- Check assumptions (plots!)
- Report cleanly (tables + figures)
- Make everything reproducible (script + Quarto)
You already have the foundation — data frames, factors, plotting.
Want to continue?
- Do a full t-test + visualization together?
- Try logistic regression example?
- Learn correlation + scatter matrix?
- Or jump straight to reporting with gtsummary?
Just tell me — whiteboard is ready! 📈🧮🚀
