Chapter 11: Natural Language Processing (NLP) Essentials

Natural Language Processing (NLP) Essentials, explained like we’re sitting together in Airoli on a quiet Saturday afternoon in late January 2026 — laptop open, me typing code while you follow along in your own notebook, filter coffee getting cold because we’re too excited about how far text understanding has come.

NLP in 2026 is no longer “nice to have” — it’s core infrastructure. In India especially (fintech, e-commerce, customer support at PhonePe, Jio, Zomato, Swiggy, banking apps), almost every company has chatbots, review analysis, ticket classification, fraud detection from transaction notes, or Hindi/Marathi sentiment from social media. Hugging Face has made state-of-the-art models accessible even to beginners, so we’ll lean heavily on it.

We’ll build intuition first, then code, then a small end-to-end project.

1. Text Preprocessing (Tokenization, Stemming, Lemmatization)

Raw text is messy — punctuation, case, contractions, typos, multiple languages. Preprocessing cleans it so models focus on meaning.

Tokenization — splitting text into tokens (words, subwords, characters).

Python

2026 reality: For transformers/BERT → use subword tokenization (WordPiece, BPE) — Hugging Face handles it automatically.

Stemming — chops words to root (fast, crude).

Python

Lemmatization — reduces to dictionary form (slower, smarter, needs POS tag).

Python

Modern best practice (2026): For classical ML → lemmatize + remove stopwords/punctuation. For transformers → almost no manual preprocessing — just raw text (BERT tokenizer handles case, punctuation, subwords).

2. Bag of Words (BoW) & TF-IDF

Bag of Words — counts word occurrences (ignores order).

Python

TF-IDF — Term Frequency × Inverse Document Frequency Down-weights common words (the, is), up-weights rare but informative ones.

Python

Use case: Text classification (spam, sentiment) with LogisticRegression or Naive Bayes — still strong baseline in 2026.

3. Word Embeddings (Word2Vec, GloVe)

Words → dense vectors (e.g., 300 dimensions) where similar meanings are close (king – man + woman ≈ queen).

Word2Vec (Google, 2013) — two flavors: CBOW (predict word from context), Skip-gram (predict context from word).

GloVe (Stanford) — global matrix factorization.

In 2026 you rarely train from scratch — use pre-trained.

Python

FastText (Facebook) — subword info → handles typos, Hindi/Marathi better.

4. Transformers & BERT Basics (using Hugging Face)

Transformers (2017 paper) → self-attention → parallel, long-range dependencies. BERT (2018) — bidirectional, pre-trained on masked LM + next sentence → understands context deeply.

Hugging Face transformers library made this accessible — 2026 standard.

Install:

Bash

Sentiment analysis with pre-trained BERT (easiest start)

Python

Text classification (custom fine-tuning) — e.g., classify customer complaints as “Billing”, “Network”, “App”, “Fraud”.

Python

After training → trainer.predict() or pipeline(“text-classification”, model=model, tokenizer=tokenizer)

5. Sentiment Analysis & Text Classification Projects

Mini Project 1: Hindi/English Customer Review Sentiment (India-relevant)

Dataset ideas:

  • Flipkart / Amazon reviews (scrape or Kaggle multilingual)
  • Twitter/X Hindi sentiment (use our earlier X tools if needed)

Workflow:

  1. Load reviews
  2. Use ai4bharat/indic-bert or bert-base-multilingual-cased for Hindi+English
  3. Fine-tune on labeled subset (positive/neutral/negative)
  4. Deploy simple Streamlit app (Chapter 5 style):
    • Input box for review
    • Output: sentiment + confidence
    • Bonus: highlight keywords (attention visualization)

Mini Project 2: Complaint Ticket Classifier

Classes: Network Issue, Billing, App Crash, Fraud Alert, Other

  • Use RoBERTa or DistilBERT (faster)
  • Add domain keywords in prompt if zero-shot
  • Evaluate: F1-macro (imbalanced classes)

2026 India tips:

  • Handle code-mixed Hindi-English (Hinglish) — IndicBERT, MuRIL, or IndicGLUE models.
  • Low-resource → few-shot with SetFit or prompt-tuning.
  • Production → ONNX export or FastAPI + Hugging Face Inference API.

That’s Chapter 11 — you now have the full pipeline from raw text to deployed classifier!

Practice:

  • Run the sentiment pipeline on 10 recent Zomato/Swiggy reviews you copy-paste.
  • Fine-tune DistilBERT on a Kaggle sentiment dataset (e.g., Twitter US Airline Sentiment).
  • Build a quick Streamlit app that takes customer feedback and predicts category/sentiment.

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *