Exploring the Topology, Geometry, and Linguistics of Large Language Models
As people increasingly rely on large language models (LLMs) to assist in crafting text for important tasks, it is critical that users and consumers of generative output understand the use cases for which LLMs are appropriate. Despite continuing advances in model development, there is little mechanistic understanding of when and why LLMs are performant, or the conditions under which they are not.
It has been hypothesized that the local topology of the input embedding of an LLM could reflect semantic properties of the words, an idea that appears to be consistent with the data. Words with small local dimension or are near singularities in the token subspace—topological and geometric properties—are expected to play linguistically significant roles. Existing papers do not address differences between possible embedding strategies, nor the the fact that different LLMs will use different embeddings. In short, while they acknowledge that topological properties can be estimated once the token subspace is found, they do not address how to find the subspace in the first place.
This talk demonstrates that when the subspace of tokens used by a LLM is constructed using the geometry induced by its latent space, the dimension and curvature of this subspace can be estimated reliably. The dimension varies within a connected component of this space, so the space of tokens cannot be a manifold but is instead a stratified manifold. Moreover, we find that within a stratum, the Ricci curvature is uniformly negative. These two findings explain why there appear to be instabilities in training and use of large language models for query response, and moreover that these instabilities are unavoidable in most cases.