Chapter 6: SciPy Sparse Data
sparse data in SciPy.
Today we’re talking about scipy.sparse — what people casually call “SciPy sparse matrices”, “SciPy sparse data”, or just “sparse in SciPy”.
This is the module that lets you handle very large matrices that are almost all zeros without exploding your computer’s memory or waiting forever for computations.
First — What does “sparse data” actually mean? (very simple)
A matrix/array is sparse when most elements are zero (or “empty”).
Examples from real life:
- Adjacency matrix of a social network graph → millions of users, but each person follows/connects to only ~100–500 others → 99.99% zeros
- Term-document matrix in text mining / NLP → vocabulary of 50,000 words × 1 million documents → almost every word appears in only a tiny fraction of documents
- Finite element stiffness matrix in engineering simulations → huge grid, but each node only interacts with its few nearest neighbors
- Recommender systems (Netflix, Amazon) → users × items matrix → most users have rated only a handful of movies/products
If you store these as normal NumPy arrays (ndarray), you waste gigabytes of RAM on zeros that do nothing.
SciPy sparse stores only the non-zero values + their positions → massive memory savings + often faster math on sparse structure.
Important change in recent SciPy (2024–2026 era)
SciPy used to call them sparse matrices (csr_matrix, coo_matrix, etc.) Now (SciPy 1.13 → 1.17+) the recommended types are sparse arrays (csr_array, coo_array, etc.)
- They behave more like NumPy arrays (better broadcasting, @ for matrix multiply, etc.)
- Old *_matrix classes still exist for backward compatibility
- New code → always prefer coo_array, csr_array, csc_array, etc.
(As of Feb 2026 → latest stable is SciPy 1.17.0 released Jan 2026)
The seven main sparse formats in scipy.sparse (2026)
| Format | Class name | Best for / strengths | Weaknesses / avoid when | Construction style |
|---|---|---|---|---|
| COO | coo_array | Easy & fast construction from lists of (row, col, value) | Arithmetic & repeated access (slow) | Triplet lists — most flexible start |
| CSR | csr_array | Fast row slicing, matrix-vector multiply (Ax), arithmetic | Slow column slicing | Most common for final computations |
| CSC | csc_array | Fast column slicing, matrix-vector (xᵀA) | Slow row slicing | Good when working column-wise |
| LIL | lil_array | Fast incremental building / editing via indexing | Very slow arithmetic & conversion | Good for slowly filling a matrix |
| DOK | dok_array | Dictionary-like → convenient random access/insert | Slow arithmetic | Like a dict[(i,j)] = value |
| DIA | dia_array | Band-diagonal / tridiagonal matrices | Only useful for banded structure | Store offsets + diagonals |
| BSR | bsr_array | Block-structured (e.g. small dense blocks) | Overhead if blocks are tiny | Advanced – finite elements, etc. |
Golden rule most people follow in 2026:
- Build with COO, LIL, or DOK (easiest/fastest to construct)
- Convert to CSR or CSC for actual math/solving/linear algebra
- CSR → best for row-wise operations & most sparse.linalg solvers
- CSC → best for column-wise
Let’s do real examples — copy-paste these into Jupyter
Always start like this:
|
0 1 2 3 4 5 6 7 8 9 |
import numpy as np import scipy.sparse as sp import scipy.sparse.linalg as sla # ← solvers live here import matplotlib.pyplot as plt |
Example 1 — Create a tiny sparse matrix three different ways
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# Way 1: COO from triplets (most common starting point) rows = [0, 0, 1, 2, 2] # i indices cols = [0, 2, 1, 0, 2] # j indices values = [4, 7, 5, 9, 3] A_coo = sp.coo_array((values, (rows, cols)), shape=(3, 3)) print(A_coo) # <3x3 sparse array of type '<class 'numpy.int64'>' # with 5 stored elements in COOrdinate format> print(A_coo.toarray()) # convert to dense view (only for small matrices!) # [[4 0 7] # [0 5 0] # [9 0 3]] |
Way 2: LIL — incremental filling (good when you build gradually)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 |
A_lil = sp.lil_array((3, 3), dtype=float) A_lil[0,0] = 4.2 A_lil[0,2] = 7.1 A_lil[1,1] = 5.0 A_lil[2,0] = 9.0 A_lil[2,2] = 3.0 A_csr = A_lil.tocsr() # almost always convert to CSR after building |
Way 3: From dense (only do this for small or testing!)
|
0 1 2 3 4 5 6 7 8 9 |
dense = np.array([[0, 0, 1], [2, 0, 0], [0, 3, 0]]) A = sp.csr_array(dense) # or sp.csc_array, sp.coo_array(dense) |
Example 2 — Memory savings (the wow moment)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
n = 5000 density = 0.005 # 0.5% non-zeros → very sparse # Dense version (wasteful) dense = sp.random_array((n, n), density=density, format='coo').toarray() print(f"Dense memory: {dense.nbytes / 1e6:.1f} MB") # Sparse CSR version (tiny!) A = sp.random_array((n, n), density=density, format='csr') print(f"Sparse CSR memory: {A.data.nbytes / 1e6:.2f} + index overhead ≈ {A.nnz * 12 / 1e6:.2f} MB") # → Dense might be ~200–800 MB, sparse often < 5–20 MB |
Example 3 — Solving Ax = b with sparse solver (real power)
|
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
# Create a sparse Poisson matrix (common in physics/engineering) from scipy.sparse import diags_array from scipy.sparse.linalg import spsolve N = 100 main_diag = 4 * np.ones(N*N) off_diag = -1 * np.ones(N*N - 1) off_diag[N-1::N] = 0 # break connections at block ends offsets = [0, -1, 1, -N, N] A = diags_array([main_diag, off_diag, off_diag, off_diag, off_diag], offsets=offsets, shape=(N*N, N*N), format='csr') b = np.ones(N*N) # right-hand side x = spsolve(A, b) # sparse direct solver print(x.shape) # (10000,) |
→ This solves a 10,000 × 10,000 system in seconds using only ~few MB instead of gigabytes.
Quick decision table — which format when?
| Situation | Recommended start → final format |
|---|---|
| Building from lists of coordinates | COO → CSR |
| Adding/changing entries one by one | LIL or DOK → CSR |
| Need fast row access & most solvers | CSR |
| Need fast column access | CSC |
| Tridiagonal / banded matrix | DIA |
| Doing real linear algebra / eigenvalues | Convert to CSR + use sparse.linalg |
| Very large & never changing | COO (if just storing) or CSR |
Final teacher reminders (2026 style)
- Never do heavy math on COO/LIL/DOK — convert to CSR/CSC first
- Use @ for matrix multiplication (not * — * is now element-wise!)
- For huge problems → look at scipy.sparse.linalg (cg, gmres, minres, lobpcg, eigsh, etc.)
- Check memory with A.data.nbytes + A.indices.nbytes + A.indptr.nbytes
- Official docs (excellent): https://docs.scipy.org/doc/scipy/reference/sparse.html and tutorial bits: https://docs.scipy.org/doc/scipy/tutorial/sparse.html
Now tell me — what kind of sparse problem are you dealing with (or curious about)?
- Building from edge list (graph)?
- Solving huge linear system?
- Text data / recommender matrix?
- Finite differences / PDE matrix?
- Converting from dense?
Say the word and we’ll do a more targeted, realistic 20–40 line example together. 🚀
