Dimensionality and distance traps (step 1/7) · unsupervised learning, embeddings, and recommenders

Dimensionality and distance

To group, recommend, or find "similar" records, a model needs a number for how far apart two feature vectors are. That number is a distance metric.

The two you will meet most:

Euclidean distance — straight-line distance: square the difference in each dimension, add them, take the square root. sqrt(sum((a_i - b_i)**2)).
Manhattan distance — grid distance: add the absolute difference in each dimension. sum(abs(a_i - b_i)).

Both reduce two vectors to one comparable number. Smaller distance = more similar.

Then comes the trap: the curse of dimensionality. As you add more dimensions, randomly placed points drift toward being equally far from each other. The distance between the nearest pair and the farthest pair shrinks toward the same value. So "find the nearest neighbor" — the whole idea behind kNN, clustering, and similarity search — quietly stops meaning much once you have hundreds of mostly-noise features. The builder fix is to cut dimensions down to the ones that carry signal (feature selection, embeddings, PCA) before trusting distances.