Chapter 34: R Factors

R Factors — one of the most confusing but extremely useful concepts in R, especially when you start doing real data analysis.

Many beginners hate factors at first (they cause strange surprises), but once you understand them, they become one of your best friends — especially for categorical data, statistical modeling, plotting with meaningful order, and avoiding silly mistakes.

I’ll explain it like we’re sitting together in RStudio, step-by-step, with lots of live examples, why factors exist, when they hurt you, when they save you, and the modern 2026 way to handle them.

1. What is a Factor? (The Honest, Simple Explanation)

A factor is R’s special way of storing categorical data — values that fall into a fixed set of categories (levels).

Examples of categorical data:

  • City: “Hyd”, “Bng”, “Del”, “Mum”
  • Gender: “Male”, “Female”, “Other”
  • Rating: “Poor”, “Fair”, “Good”, “Excellent”
  • Day of week: “Mon”, “Tue”, … “Sun”
  • Yes/No answers: “Yes”, “No”

Internally, a factor is:

  • A character vector underneath (the actual text values)
  • + an extra attribute called levels — the complete list of possible categories, and their order

So a factor has two parts:

  1. The values you see (e.g. “Hyd”, “Bng”)
  2. The levels (e.g. c(“Bng”, “Del”, “Hyd”, “Mum”)) — and this order matters!

2. How to Create a Factor (Old vs New Way)

Old / classic way (still very common)

R

→ Notice: levels are alphabetically sorted by default!

Modern / recommended way (2026 style)

You usually want control over the order of levels, so you explicitly set them:

R

3. Why Does Order of Levels Matter? (The Real Power)

  1. Plotting — bars, boxplots, etc. appear in level order, not alphabetical
R
  1. Statistical modeling (lm, glm, aov, etc.) — R treats the first level as reference (baseline)
R
  1. Ordered factors (ordinal data) — for things with natural order
R

→ Now R knows “Excellent” > “Good” > “Fair” > “Poor”

4. Common Surprising Behaviors (Why Beginners Get Frustrated)

Surprise 1 — New category added later = NA

R

→ “Vsk” becomes NA because it wasn’t in the original levels!

Fix: always include all possible levels or use levels = union(…)

Surprise 2 — Dropping unused levels

R

Surprise 3 — Converting factor → character loses levels

R

→ This is why you sometimes get surprises when exporting or joining data.

5. Modern Best Practice in 2026 (Avoid Most Pain)

Rule #1: Avoid automatic factor conversion when reading data

R

Rule #2: Turn into factor only when needed, and control levels

R

Rule #3: Use forcats package (from tidyverse) — makes factor handling beautiful

R

6. Your Mini Practice Right Now (Copy → Run!)

R

You just created meaningful factors with proper order — this is how real analyses look!

Quick Summary Cheat-Sheet

  • Factor = categorical variable with fixed levels
  • Levels have order — controls plotting & reference category
  • Create → factor(x, levels = …, ordered = …)
  • Modern → forcats package + mutate() + explicit levels
  • Avoid auto-factors on import → stringsAsFactors = FALSE
  • Use ordered = TRUE for ordinal data (low < medium < high)
  • Common pain → new values become NA → always set full levels

Feeling clearer about factors now?

Next questions?

  • Want to practice forcats tricks together (very powerful)?
  • How factors behave in ggplot2 plots?
  • Or move to subsetting data frames or dplyr joins?

Just tell me — whiteboard is ready! ☕📊🚀

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *