Hierarchical clustering in LLM embedding space allows extraction of the LLM's knowledge ontology

202508291325 Status: #idea Tags: #llm #machine_learning #knowledge_representation #clustering #ai # Hierarchical clustering in LLM embedding space allows extraction of the LLM's knowledge ontology Large Language Models encode semantic relationships in their embedding spaces, with semantically related concepts positioned closer together. By applying hierarchical clustering algorithms to these embeddings, we can extract the implicit knowledge ontology that the LLM has learned during training. This process reveals how the model organizes knowledge - which concepts it considers related, how it structures hierarchical relationships (e.g., animal -> mammal -> dog), and what semantic distinctions it makes. The resulting clusters and their hierarchical structure essentially provide a map of the model's internal knowledge representation. This approach can be used for: 1. Understanding model behavior and biases 2. Knowledge graph construction 3. Curriculum design based on semantic similarity 4. Model interpretation and debugging 5. Transfer learning strategies The quality of extracted ontologies depends on the model's training data and the clustering algorithm used. ## Related Ideas - [[Pairwise distance matrices can be hierarchically clustered to form trees or ontologies]] - [[Expert knowledge is vector-retrieval + graph search]] - [[Learning is compression and compression is connection]] - [[Encoder-Only Transformers with MLM Objective as Continuous Fuzzy String Matching]] - [[Science is compression]] --- # References