Chapter 34: R Factors
R Factors — one of the most confusing but extremely useful concepts in R, especially when you start doing real data analysis.
Many beginners hate factors at first (they cause strange surprises), but once you understand them, they become one of your best friends — especially for categorical data, statistical modeling, plotting with meaningful order, and avoiding silly mistakes.
I’ll explain it like we’re sitting together in RStudio, step-by-step, with lots of live examples, why factors exist, when they hurt you, when they save you, and the modern 2026 way to handle them.
1. What is a Factor? (The Honest, Simple Explanation)
A factor is R’s special way of storing categorical data — values that fall into a fixed set of categories (levels).
Examples of categorical data:
- City: “Hyd”, “Bng”, “Del”, “Mum”
- Gender: “Male”, “Female”, “Other”
- Rating: “Poor”, “Fair”, “Good”, “Excellent”
- Day of week: “Mon”, “Tue”, … “Sun”
- Yes/No answers: “Yes”, “No”
Internally, a factor is:
- A character vector underneath (the actual text values)
- + an extra attribute called levels — the complete list of possible categories, and their order
So a factor has two parts:
- The values you see (e.g. “Hyd”, “Bng”)
- The levels (e.g. c(“Bng”, “Del”, “Hyd”, “Mum”)) — and this order matters!
2. How to Create a Factor (Old vs New Way)
Old / classic way (still very common)
|
0 1 2 3 4 5 6 7 8 9 10 11 |
cities <- c("Hyd", "Bng", "Del", "Mum", "Hyd", "Bng") city_factor <- factor(cities) print(city_factor) # [1] Hyd Bng Del Mum Hyd Bng # Levels: Bng Del Hyd Mum |
→ Notice: levels are alphabetically sorted by default!
Modern / recommended way (2026 style)
You usually want control over the order of levels, so you explicitly set them:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
city_factor_ordered <- factor(cities, levels = c("Hyd", "Bng", "Del", "Mum", "Kol", "Chn"), ordered = FALSE) # or TRUE for ordered factor print(city_factor_ordered) # [1] Hyd Bng Del Mum Hyd Bng # Levels: Hyd Bng Del Mum Kol Chn |
3. Why Does Order of Levels Matter? (The Real Power)
- Plotting — bars, boxplots, etc. appear in level order, not alphabetical
|
0 1 2 3 4 5 6 7 8 9 10 |
# Without order control (alphabetical mess) barplot(table(city_factor)) # With meaningful order barplot(table(city_factor_ordered)) |
- Statistical modeling (lm, glm, aov, etc.) — R treats the first level as reference (baseline)
|
0 1 2 3 4 5 6 7 8 |
model <- lm(marks ~ city_factor_ordered, data = students) summary(model) # → "Hyd" is the reference category (not shown in coefficients) |
- Ordered factors (ordinal data) — for things with natural order
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
rating <- factor(c("Poor", "Good", "Excellent", "Fair", "Good"), levels = c("Poor", "Fair", "Good", "Excellent"), ordered = TRUE) print(rating) # [1] Poor Good Excellent Fair Good # Levels: Poor < Fair < Good < Excellent |
→ Now R knows “Excellent” > “Good” > “Fair” > “Poor”
4. Common Surprising Behaviors (Why Beginners Get Frustrated)
Surprise 1 — New category added later = NA
|
0 1 2 3 4 5 6 7 8 9 10 |
new_cities <- c("Hyd", "Vsk", "Bng") # Vsk = new city factor(new_cities, levels = levels(city_factor_ordered)) # [1] Hyd <NA> Bng # Levels: Hyd Bng Del Mum Kol Chn |
→ “Vsk” becomes NA because it wasn’t in the original levels!
Fix: always include all possible levels or use levels = union(…)
Surprise 2 — Dropping unused levels
|
0 1 2 3 4 5 6 7 |
droplevels(city_factor_ordered) # automatically removes levels with zero counts |
Surprise 3 — Converting factor → character loses levels
|
0 1 2 3 4 5 6 |
as.character(city_factor_ordered) # just plain text, no levels anymore |
→ This is why you sometimes get surprises when exporting or joining data.
5. Modern Best Practice in 2026 (Avoid Most Pain)
Rule #1: Avoid automatic factor conversion when reading data
|
0 1 2 3 4 5 6 7 8 9 10 |
# Good sales <- read.csv("sales.csv", stringsAsFactors = FALSE) # Or globally (recommended) options(stringsAsFactors = FALSE) # default since R 4.0 anyway |
Rule #2: Turn into factor only when needed, and control levels
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Best pattern df <- df |> mutate( city = factor(city, levels = c("Hyd", "Bng", "Del", "Mum", "Kol"), ordered = FALSE), satisfaction = factor(satisfaction, levels = c("Very Dissatisfied", "Dissatisfied", "Neutral", "Satisfied", "Very Satisfied"), ordered = TRUE) ) |
Rule #3: Use forcats package (from tidyverse) — makes factor handling beautiful
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
library(forcats) # Reorder levels by frequency df$city <- fct_infreq(df$city) # most common first # Reorder by another variable df$city <- fct_reorder(df$city, df$sales, .fun = mean) # Lump rare levels together df$city <- fct_lump_lowfreq(df$city, other_level = "Other") # Change level names df$city <- fct_recode(df$city, "Hyderabad" = "Hyd") |
6. Your Mini Practice Right Now (Copy → Run!)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
# Create messy factor data responses <- data.frame( respondent = 1:6, mood = c("Happy", "Neutral", "Sad", "Happy", "Very Happy", "Neutral"), city = c("Hyd", "Bng", "Del", "Hyd", "Mum", "Hyd") ) # Turn into proper ordered factor responses <- responses |> mutate( mood = factor(mood, levels = c("Sad", "Neutral", "Happy", "Very Happy"), ordered = TRUE), city = factor(city, levels = c("Hyd", "Bng", "Del", "Mum"), ordered = FALSE) ) # Now look at summaries & plots summary(responses$mood) # Bar plot – notice correct order barplot(table(responses$mood), main = "Mood Distribution") # Compare means if you had numbers tapply(responses$mood, responses$city, table) |
You just created meaningful factors with proper order — this is how real analyses look!
Quick Summary Cheat-Sheet
- Factor = categorical variable with fixed levels
- Levels have order — controls plotting & reference category
- Create → factor(x, levels = …, ordered = …)
- Modern → forcats package + mutate() + explicit levels
- Avoid auto-factors on import → stringsAsFactors = FALSE
- Use ordered = TRUE for ordinal data (low < medium < high)
- Common pain → new values become NA → always set full levels
Feeling clearer about factors now?
Next questions?
- Want to practice forcats tricks together (very powerful)?
- How factors behave in ggplot2 plots?
- Or move to subsetting data frames or dplyr joins?
Just tell me — whiteboard is ready! ☕📊🚀
