Chapter 18: Text Processing
What is Text Processing? (super simple first)
Text processing means doing smart things with written words using a computer.
Instead of just reading text like a human, the computer changes, cleans, cuts, counts, understands, or transforms the text so that:
- We can search it faster
- Find important names/dates/places
- Make chatbots answer questions
- Detect if a review is positive or negative
- Translate English to Telugu/Hindi
- Summarize long articles
- Remove spam emails
- Build AI that writes like humans
In short: Raw text (messy, full of junk) → Computer turns it into clean, organized data → We can do cool things with it.
Text processing has two big worlds:
- Classic / Basic text processing (1970s–2000s) → editing, searching, replacing in files (like grep, sed, awk in Bash)
- Modern / NLP text processing (2010s–now) → preparing text for AI/machine learning (tokenization, cleaning, stemming, etc.)
Today almost everyone means the modern NLP version when they say “text processing”.
Why do we need Text Processing? (very important to understand)
Raw text is dirty like street food before washing:
- “Hello!!! How r u? 😊 I’m gr8!! #Hyderabad”
- “Hyderabad, Telangana, India – Pin: 500081”
- “I can’t wait for the new season of my favourite show!”
- “<html><body>Welcome to website!</body></html>”
Problems:
- Capital / small letters mixed
- Punctuation, emojis, hashtags
- Typos, slang (“r u” = “are you”)
- HTML tags, URLs
- Extra spaces, numbers we don’t want
- Words like “the”, “is”, “and” that don’t carry much meaning
Without cleaning → AI gets confused → bad results With good text processing → AI understands better → accurate sentiment, translation, search, etc.
Main Steps in Text Processing (the pipeline – learn this order!)
Most people follow these steps (in Python with NLTK / spaCy / simple code):
- Lowercasing (make everything small letters)
- Remove HTML tags / URLs / special characters
- Tokenization (break into words or sentences)
- Remove punctuation & numbers (sometimes keep, sometimes remove)
- Remove stop words (“the”, “is”, “and”, “of”…)
- Stemming or Lemmatization (reduce words to root form)
- (Optional advanced) Part-of-Speech tagging, Named Entity Recognition, etc.
Let’s see each one with real examples.
Step 1: Lowercasing
|
0 1 2 3 4 5 6 7 |
Before: "I LOVE Hyderabad!! It's AWESOME city in Telangana." After : "i love hyderabad!! it's awesome city in telangana." |
Simple Python:
|
0 1 2 3 4 5 6 7 |
text = "I LOVE Hyderabad!!" text.lower() # → "i love hyderabad!!" |
Why? Computer treats “Love” and “love” as different otherwise.
Step 2: Remove junk (HTML, URLs, emojis sometimes)
Example text:
|
0 1 2 3 4 5 6 |
"Check this <a href='https://example.com'>link</a> 😊 #fun" |
After cleaning:
|
0 1 2 3 4 5 6 |
"Check this link fun" |
Python way (simple regex):
|
0 1 2 3 4 5 6 7 8 |
import re text = re.sub(r'<.*?>', '', text) # remove HTML text = re.sub(r'http\S+', '', text) # remove URLs |
Step 3: Tokenization (most important step!)
Break text into small pieces (tokens = words or subwords)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
"I can't wait for the new season!" Tokens (word level): ["I", "ca", "n't", "wait", "for", "the", "new", "season", "!"] # or better: ["I", "can't", "wait", "for", "the", "new", "season", "!"] Sentence tokens: ["I can't wait for the new season!"] |
Python with NLTK:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize text = "I can't wait for the new season!" tokens = word_tokenize(text) print(tokens) # ['I', 'ca', "n't", 'wait', 'for', 'the', 'new', 'season', '!'] |
Step 4: Remove punctuation & stop words
Stop words = common boring words (very, the, is, are, in, on…)
|
0 1 2 3 4 5 6 7 8 9 10 11 |
from nltk.corpus import stopwords nltk.download('stopwords') stop_words = set(stopwords.words('english')) clean_tokens = [word for word in tokens if word.lower() not in stop_words and word.isalpha()] # Example result: ['wait', 'new', 'season'] |
Step 5: Stemming vs Lemmatization (make words basic form)
Stemming = cut aggressively (fast but crude)
- running → run
- runs → run
- runner → run
- happiness → happi (sometimes ugly)
Lemmatization = smarter (uses dictionary, knows grammar)
- running → run
- runs → run
- better → good
- went → go
Python example:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from nltk.stem import PorterStemmer stemmer = PorterStemmer() print(stemmer.stem("running")) # run print(stemmer.stem("happiness")) # happi from nltk.stem import WordNetLemmatizer nltk.download('wordnet') lemmatizer = WordNetLemmatizer() print(lemmatizer.lemmatize("running", pos='v')) # run print(lemmatizer.lemmatize("better", pos='a')) # good |
Real-life examples where text processing is used
- Google search → understands your query after processing
- ChatGPT / Grok → reads your message after tokenization & cleaning
- Amazon reviews → sentiment analysis (positive/negative)
- WhatsApp spam filter
- News summarization apps
- Resume parser for job sites (extract name, skills, experience)
Quick cheat-sheet table
| Step | What it does | Example Before → After | Tool/Library (Python) |
|---|---|---|---|
| Lowercasing | All small letters | “Hello HYDERABAD” → “hello hyderabad” | .lower() |
| Remove HTML/URLs | Clean junk | “Hi https://x.com” → “Hi” | re.sub() |
| Tokenization | Split to words | “I love coding!” → [‘I’, ‘love’, ‘coding’, ‘!’] | nltk.word_tokenize() |
| Remove stop words | Delete boring words | “I am in Hyderabad” → [‘Hyderabad’] | nltk stopwords |
| Stemming | Cut to root (fast) | “running runs” → “run run” | PorterStemmer |
| Lemmatization | Proper root (smart) | “better went” → “good go” | WordNetLemmatizer |
Try it yourself right now!
Install once:
|
0 1 2 3 4 5 6 |
pip install nltk |
Then in Python:
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import nltk nltk.download('punkt') nltk.download('stopwords') nltk.download('wordnet') text = "I can't believe how awesome Hyderabad is in 2026!!! 😊" # Lower text = text.lower() # Tokenize tokens = nltk.word_tokenize(text) # Remove non-alpha & stop words stop_words = set(nltk.corpus.stopwords.words('english')) clean = [w for w in tokens if w.isalpha() and w not in stop_words] print(clean) # Something like: ['ca', 'believe', 'awesome', 'hyderabad'] |
- Do you want to see full code for sentiment analysis using text processing?
- Or how to do it in Bash only (grep, sed, awk)?
- Or examples for Telugu text processing?
Ask anything – we’ll go deeper together! 😄
