Geometry
In \( \mathbb{R}^n \) where \( n \) is large,
- The volume of the unit ball goes to 0 as the dimension goes to infinity.
- Most of the volume of a n-ball is concentrated around its boundary.
- Pick two vectors on the surface of the unit ball independently. The two are orthogonal with high probability.
- Johnson-Lindenstrauss lemma: a set of points can be embedded almost isometrically w.r.t. \( L_2 \) into a space of a much lower dimension. The \( L_1 \) analogue doesn’t hold. Look up fast JL transform.
- Distances between points are concentrated: \( \frac{ \max_{x, y} d(x,y) - \min_{x, y} d(x,y) } { \min_{x, y} d(x,y) } \to 0 \) where \( d \) is the \( L_p \) norm where \( p \geq 1 \).
ML
Suppose the data has \( p \) features,
- Suppose that the features are binary categorical. The cardinality of the feature space grows exponentially with \( p \). Suppose the size of the training set is fixed; then, the model is highly likely to overfit when \( p \) is large.
- Suppose that the features are continuous and in \( [0,1] \). Then, most points are located outside of the sphere inscribed in the cube.
Reference
- Princeton COS 521
- Why is Euclidean distance not a good metric in high dimensions?
- The Curse of Dimensionality in classification
- Domingos, “A Few Useful Things to Know about Machine Learning” (link)