Chapter 11: Natural Language Processing (NLP) Essentials
Natural Language Processing (NLP) Essentials, explained like we’re sitting together in Airoli on a quiet Saturday afternoon in late January 2026 — laptop open, me typing code while you follow along in your own notebook, filter coffee getting cold because we’re too excited about how far text understanding has come.
NLP in 2026 is no longer “nice to have” — it’s core infrastructure. In India especially (fintech, e-commerce, customer support at PhonePe, Jio, Zomato, Swiggy, banking apps), almost every company has chatbots, review analysis, ticket classification, fraud detection from transaction notes, or Hindi/Marathi sentiment from social media. Hugging Face has made state-of-the-art models accessible even to beginners, so we’ll lean heavily on it.
We’ll build intuition first, then code, then a small end-to-end project.
1. Text Preprocessing (Tokenization, Stemming, Lemmatization)
Raw text is messy — punctuation, case, contractions, typos, multiple languages. Preprocessing cleans it so models focus on meaning.
Tokenization — splitting text into tokens (words, subwords, characters).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Simple split (never use in production) text = "I'm learning NLP in Airoli, Maharashtra! #DataScience" tokens = text.lower().split() print(tokens) # ['i'm', 'learning', 'nlp', 'in', 'airoli,', 'maharashtra!', '#datascience'] # Better: NLTK or spaCy import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize tokens = word_tokenize(text.lower()) print(tokens) # ['i', "'m", 'learning', 'nlp', 'in', 'airoli', ',', 'maharashtra', '!', '#', 'datascience'] |
2026 reality: For transformers/BERT → use subword tokenization (WordPiece, BPE) — Hugging Face handles it automatically.
Stemming — chops words to root (fast, crude).
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("running")) # run print(stemmer.stem("studies")) # studi print(stemmer.stem("better")) # better (not great) |
Lemmatization — reduces to dictionary form (slower, smarter, needs POS tag).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
nltk.download('wordnet') nltk.download('omw-1.4') from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("running", pos='v')) # run print(lemmatizer.lemmatize("better", pos='a')) # good |
Modern best practice (2026): For classical ML → lemmatize + remove stopwords/punctuation. For transformers → almost no manual preprocessing — just raw text (BERT tokenizer handles case, punctuation, subwords).
2. Bag of Words (BoW) & TF-IDF
Bag of Words — counts word occurrences (ignores order).
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from sklearn.feature_extraction.text import CountVectorizer docs = [ "I love Mumbai weather", "Mumbai has great food but traffic is bad", "I hate traffic in Airoli" ] vectorizer = CountVectorizer() X_bow = vectorizer.fit_transform(docs) print(vectorizer.get_feature_names_out()) # ['airoli' 'bad' 'but' 'food' 'great' 'has' 'hate' 'in' 'is' 'love' 'mumbai' 'traffic' 'weather'] print(X_bow.toarray()) |
TF-IDF — Term Frequency × Inverse Document Frequency Down-weights common words (the, is), up-weights rare but informative ones.
|
0 1 2 3 4 5 6 7 8 9 10 |
from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer(max_features=5000, stop_words='english') X_tfidf = tfidf.fit_transform(docs) print(X_tfidf.toarray()) |
Use case: Text classification (spam, sentiment) with LogisticRegression or Naive Bayes — still strong baseline in 2026.
3. Word Embeddings (Word2Vec, GloVe)
Words → dense vectors (e.g., 300 dimensions) where similar meanings are close (king – man + woman ≈ queen).
Word2Vec (Google, 2013) — two flavors: CBOW (predict word from context), Skip-gram (predict context from word).
GloVe (Stanford) — global matrix factorization.
In 2026 you rarely train from scratch — use pre-trained.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Gensim for Word2Vec / GloVe import gensim.downloader as api # Download pre-trained (Google News 300d) model = api.load('word2vec-google-news-300') # King - man + woman = ? print(model.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)) # ≈ queen print(model.similarity('mumbai', 'airoli')) # high print(model.similarity('mumbai', 'pizza')) # low |
FastText (Facebook) — subword info → handles typos, Hindi/Marathi better.
4. Transformers & BERT Basics (using Hugging Face)
Transformers (2017 paper) → self-attention → parallel, long-range dependencies. BERT (2018) — bidirectional, pre-trained on masked LM + next sentence → understands context deeply.
Hugging Face transformers library made this accessible — 2026 standard.
Install:
|
0 1 2 3 4 5 6 |
pip install transformers torch sentencepiece |
Sentiment analysis with pre-trained BERT (easiest start)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
from transformers import pipeline # Zero-shot or fine-tuned sentiment sentiment = pipeline("sentiment-analysis", model="nlptown/bert-base-multilingual-uncased-sentiment") texts = [ "This recharge plan is amazing! Fast 5G in Airoli.", "Worst customer support ever, call never connects.", "Average experience, nothing special." ] for text in texts: result = sentiment(text) print(text) print(result) # label + score |
Text classification (custom fine-tuning) — e.g., classify customer complaints as “Billing”, “Network”, “App”, “Fraud”.
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments from datasets import load_dataset # Example: use a small Hindi/English complaint dataset or IMDB for demo tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128) # Assume you have dataset with 'text' and 'label' (0=negative, 1=positive) dataset = load_dataset("imdb") # or your own CSV tokenized_datasets = dataset.map(tokenize_function, batched=True) model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2) training_args = TrainingArguments( output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16, per_device_eval_batch_size=16, evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True, metric_for_best_model="accuracy", ) trainer = Trainer( model=model, args=training_args, train_dataset=tokenized_datasets["train"], eval_dataset=tokenized_datasets["test"], ) trainer.train() |
After training → trainer.predict() or pipeline(“text-classification”, model=model, tokenizer=tokenizer)
5. Sentiment Analysis & Text Classification Projects
Mini Project 1: Hindi/English Customer Review Sentiment (India-relevant)
Dataset ideas:
- Flipkart / Amazon reviews (scrape or Kaggle multilingual)
- Twitter/X Hindi sentiment (use our earlier X tools if needed)
Workflow:
- Load reviews
- Use ai4bharat/indic-bert or bert-base-multilingual-cased for Hindi+English
- Fine-tune on labeled subset (positive/neutral/negative)
- Deploy simple Streamlit app (Chapter 5 style):
- Input box for review
- Output: sentiment + confidence
- Bonus: highlight keywords (attention visualization)
Mini Project 2: Complaint Ticket Classifier
Classes: Network Issue, Billing, App Crash, Fraud Alert, Other
- Use RoBERTa or DistilBERT (faster)
- Add domain keywords in prompt if zero-shot
- Evaluate: F1-macro (imbalanced classes)
2026 India tips:
- Handle code-mixed Hindi-English (Hinglish) — IndicBERT, MuRIL, or IndicGLUE models.
- Low-resource → few-shot with SetFit or prompt-tuning.
- Production → ONNX export or FastAPI + Hugging Face Inference API.
That’s Chapter 11 — you now have the full pipeline from raw text to deployed classifier!
Practice:
- Run the sentiment pipeline on 10 recent Zomato/Swiggy reviews you copy-paste.
- Fine-tune DistilBERT on a Kaggle sentiment dataset (e.g., Twitter US Airline Sentiment).
- Build a quick Streamlit app that takes customer feedback and predicts category/sentiment.
