Chapter 20: Bash Pattern Scan (awk)
What is awk? (super simple first)
awk = named after its creators (Aho, Weinberger, Kernighan) It’s a text processing language built into Linux/Bash for scanning patterns and doing actions on structured text (like CSV, logs, tables).
Think of it as:
- grep on steroids (finds patterns + does calculations/changes)
- A mini programming language that reads line-by-line, splits into fields, and lets you print, calculate, filter, summarize
Most people use awk for:
- Extract columns from files
- Sum/average/count numbers
- Process logs (find errors, count IPs)
- Format reports
- Replace text smarter than sed
- One-liners that save hours in scripts
Basic Structure (memorize this!)
|
0 1 2 3 4 5 6 |
awk 'pattern { action }' file.txt |
- pattern → when to run the action (optional)
- action → what to do (in { } curly braces)
- If no pattern → action runs on every line
- If no action → default is { print $0 } (print whole line)
Fields: awk automatically splits each line into $1, $2, $3… using spaces/tabs as separator (default)
- $0 = whole line
- $1 = first field
- $NF = last field
- NF = number of fields in line
- NR = current record/line number
1. Create a test file right now (copy-paste)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 |
cat > employees.txt << 'EOF' 101,Arman,Hyderabad,Developer,85000 102,Rahul,Delhi,Manager,120000 103,Priya,Bangalore,Tester,65000 104,Suresh,Hyderabad,Developer,90000 105,Anjali,Chennai,HR,70000 EOF |
(Comma-separated – common in real life)
2. Super basic examples (try these now!)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Print whole file (like cat) awk '{print $0}' employees.txt # Print only name (2nd field) awk -F ',' '{print $2}' employees.txt # Output: # Arman # Rahul # Priya # Suresh # Anjali # Print name and salary awk -F ',' '{print $2 " earns " $5}' employees.txt # Arman earns 85000 # Rahul earns 120000 # ... # Print name + salary with header awk -F ',' 'BEGIN {print "Employee Salary Report"} {print $2 " → ₹" $5}' employees.txt |
-F ‘,’ = change Field Separator to comma (very important!)
3. Most useful built-in variables
| Variable | Meaning | Example use |
|---|---|---|
| $0 | Whole line | print $0 |
| $1..$n | Individual fields | print $1, $3 |
| $NF | Last field | print $NF (last column) |
| NF | Number of fields in line | if (NF > 5) print “Too many fields” |
| NR | Line number (record number) | print NR “: ” $0 |
| FNR | Line number in current file | Useful when multiple files |
| FS | Input Field Separator | -F ‘:’ or BEGIN {FS=”,”} |
| OFS | Output Field Separator | BEGIN {OFS=" |
4. Patterns (when to run action)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Only lines where salary > 80000 awk -F ',' '$5 > 80000 {print $2 " is well paid!"}' employees.txt # Rahul is well paid! # Suresh is well paid! # Lines containing "Hyderabad" awk -F ',' '/Hyderabad/ {print}' employees.txt # 101,Arman,Hyderabad,Developer,85000 # 104,Suresh,Hyderabad,Developer,90000 # First 3 lines only awk 'NR <= 3' employees.txt # Skip header if file had one awk -F ',' 'NR > 1 {print $2}' data.csv |
5. Calculations & Summaries (where awk shines)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Sum of all salaries awk -F ',' '{sum += $5} END {print "Total salary:", sum}' employees.txt # Total salary: 440000 # Average salary awk -F ',' '{sum += $5; count++} END {print "Average:", sum/count}' employees.txt # Average: 88000 # Count developers awk -F ',' '$4 == "Developer" {count++} END {print count " developers"}' employees.txt # 2 developers # Highest salary awk -F ',' 'NR==1 {max=$5} $5 > max {max=$5} END {print "Highest:", max}' employees.txt |
BEGIN {} → runs before first line END {} → runs after last line (perfect for totals)
6. Real-world one-liners you will use daily
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Top 10 IP addresses from nginx/apache log awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -10 # Count errors in log awk '/ERROR|FAIL/ {count++} END {print count " errors"}' app.log # Print only date + message from syslog (common format) awk '{print $1, $2, $3, $5, $6, $7, $8}' /var/log/syslog # CSV → pretty table (change separator to tab) awk -F ',' 'BEGIN {OFS="\t"} {print $1,$2,$3}' data.csv # Remove duplicate lines (keep first occurrence) awk '!seen[$0]++' file.txt # Add line numbers awk '{print NR "\t" $0}' notes.txt # Extract emails from file awk '{for(i=1;i<=NF;i++) if ($i ~ /@/) print $i}' contacts.txt |
7. Quick cheat-sheet table
| Goal | Command example (try it!) |
|---|---|
| Print column 2 (comma sep) | awk -F ',' '{print $2}' file.csv |
| Print columns 1 & 3 | awk -F ',' '{print $1, $3}' file.csv |
| Sum column 5 | awk -F ',' '{sum+=$5} END {print sum}' file.csv |
| Average column 5 | awk -F ',' '{s+=$5;n++} END {print s/n}' file.csv |
| Filter salary > 100000 | awk -F ',' '$5 > 100000' file.csv |
| Count matching lines | awk -F ',' '/Developer/ {c++} END {print c}' file.csv |
| Change separator to pipe | awk -F ‘,’ ‘BEGIN {OFS=” |
| Process multiple files | awk ‘{print FILENAME “:” $0}’ file1.txt file2.txt |
| Only lines with 5 fields | awk -F ‘,’ ‘NF == 5’ data.txt |
| Top 5 most frequent words | `awk ‘{for(i=1;i<=NF;i++) count[$i]++} END {for(w in count) print count[w], w}’ RS=”[ \n]+” file.txt |
8. Pro tips from daily use
- Always use -F for CSV/colon/tab files
- Quote the program ‘…’ to avoid shell interpreting $
- Use BEGIN for headers, END for footers/totals
- awk is fast even on GB files
- Combine with sort | uniq -c | sort -rn for top-N reports
- For very complex logic → write full .awk script:
|
0 1 2 3 4 5 6 7 8 9 |
#!/usr/bin/awk -f BEGIN { print "Report starts" } { total += $5 } END { print "Total:", total } |
Save as sum.awk, chmod +x sum.awk, then ./sum.awk -F ‘,’ employees.txt
Now open your terminal and try these 3:
|
0 1 2 3 4 5 6 7 8 |
awk -F ',' '{print $2 " → " $5}' employees.txt awk -F ',' '{sum += $5} END {print "Total pay: ₹" sum}' employees.txt awk -F ',' '$3 == "Hyderabad" {print $2 " lives in Hyd!"}' employees.txt |
Tell me what you see! Or ask:
- “How to find duplicate lines with awk?”
- “How to process JSON-like logs?”
- “Best awk for access.log analysis?”
We’ll build exact commands together! 😄
