One Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces

Yandong Sun,¹ Qiang Huang,² Ziwei Xu,¹ Yiqun Sun,¹ Yixuan Tang,¹ Anthony K. H. Tung¹

Abstract

Embedding spaces are fundamental to modern AI, translating raw data into high-dimensional vectors that encode rich semantic relationships. Yet, their internal structures remain opaque, with existing approaches often sacrificing semantic coherence for structural regularity or incurring high computational overhead to improve interpretability. To address these challenges, we introduce the Semantic Field Subspace (SFS), a geometry-preserving, context-aware representation that captures local semantic neighborhoods within the embedding space. We also propose SAFARI (SemAntic Field subspAce deteRmInation), an unsupervised, modality-agnostic algorithm that uncovers hierarchical semantic structures using a novel metric called Semantic Shift, which quantifies how semantics evolve as SFSes evolve. To ensure scalability, we develop an efficient approximation of Semantic Shift that replaces costly SVD computations, achieving a 15 $\sim$ 30 $\times$ speedup with average errors below 0.01. Extensive evaluations across six real-world text and image datasets show that SFSes outperform standard classifiers not only in classification but also in nuanced tasks such as political bias detection, while SAFARI consistently reveals interpretable and generalizable semantic hierarchies. This work presents a unified framework for structuring, analyzing, and scaling semantic understanding in embedding spaces.

1 Introduction

Embedding spaces are foundational to modern AI systems, converting unstructured inputs–such as text, images, and time series–into dense vectors that capture semantic properties in a tractable format. By translating semantic similarity into geometric proximity, these spaces enable efficient comparison, retrieval, and manipulation across modalities. This property supports a wide range of applications, including knowledge retrieval (lewis2020retrieval; guu2020realm; ram2023context; asai2024self), personalized and diverse recommendations (gan2020enhancing; hirata2022solving; huang2024diversity; sun2024diversinews), and multimodal understanding (yu2019multimodal; luo2023semantic; zhang2024learnability; yu2023self). Yet, despite their centrality, embedding spaces are often treated as black boxes, limiting interpretability and constraining targeted adaptation for downstream tasks.

Research on understanding embedding spaces generally falls into two directions. The first analyzes geometric properties to enhance both representational quality and interpretability. In Natural Language Processing (NLP), post-processing methods have tackled issues such as anisotropy and instability (mu2018all; liu2019unsupervised), while studies on contextual embeddings reveal expressivity limits (ethayarajh2019contextual). Techniques like rotation-based alignment and probing further link embedding dimensions to interpretable concepts (park2017rotated; dufter2019analytical; dalvi2019one; clark2019does). Similar geometric challenges, e.g., feature collapse and variance concentration, are found in visual (chen2020simple; he2020momentum; grill2020bootstrap) and multimodal embeddings (radford2021learning; jia2021scaling).

The second direction explores latent semantic and hierarchical structures within embedding spaces. In NLP, embeddings have been mapped to external conceptual systems for flexible semantic interpretations (simhi2023interpreting), while in vision, researchers have identified neurons encoding abstract, multimodal concepts (goh2021multimodal). Unsupervised clustering (van2020scan; caron2020unsupervised) uncover semantic groupings, and hierarchical classification approaches (deng2012hedging; dhall2020hierarchical) organize semantics into multi-level taxonomies.

Despite substantial progress, embedding spaces remain intrinsically opaque due to three fundamental challenges:

•

Abstract Semantics: Embeddings inhabit high dimensional spaces where complex, abstract relationships emerge, defying straightforward interpretation. Many existing methods enhance interoperability by restructuring embeddings, but in doing so, they often distort the native geometry, thereby undermining their practical utility.
•

Lack of Explicit Structure: While semantics are inherently structured, real-world embeddings often exhibit diffuse and irregular patterns. Existing methods either overlook this latent structure or impose rigid taxonomies, limiting flexibility across tasks and domains.
•

Limited Modality Generalization: Semantic meaning transcends text to include visual, auditory, and other modalities. However, most techniques are modality-specific and lack a unified framework for revealing semantic structures across diverse embedding spaces.

This paper investigates semantic structures directly within native embedding spaces, without re-embedding, restructuring, or imposing external constraints that alter their original geometry. To address the abstract and opaque nature of high-dimensional embeddings, we introduce a new semantic representation, Semantic Fields Subspaces (SFSes), along with SAFARI (SemAntic Field subspAce deteRmInation), an unsupervised, modality-agnostic algorithm that discovers and organizes semantic structures hierarchically. Our key contributions are as follows:

•

Interpretable Semantic Representation: We introduce SFSes, a context-aware, geometry-preserving representation that captures semantic meaning through local neighborhoods, offering interpretability without distorting the embedding space.
•

Unsupervised Hierarchical Structure Discovery: We propose SAFARI, which leverages a novel Semantic Shift metric to uncover hierarchical structures. A scalable approximation of Semantic Shift enables SAFARI to process large datasets with minimal accuracy loss.
•

Modality-Agnostic Generalization: SFSes and SAFARI are inherently modality-agnostic, uncovering hierarchical semantic structures across both text and image modalities without supervision or external ontologies.

We validate our framework on six real-world datasets across text and image modalities. SAFARI uncovers how local neighborhoods form meaningful global hierarchies, while SFSes outperform standard classifiers on text classification, deliver competitive image classification with far lower computational cost, and capture subtle semantics (e.g., political bias) often missed by conventional methods. Our Semantic Shift approximation achieves a 15 $\sim$ 30 $\times$ speedup over full SVD, with average errors below 0.01, ensuring both efficiency and accuracy. Together, SFSes and SAFARI form a unified, interpretable, and scalable framework for semantic understanding within embedding spaces.

2 Related Work

Research on understanding embedding spaces generally follows two directions: (1) structural analysis of geometric properties and (2) discovery of latent semantic hierarchies.

Structural Analysis of Embedding Spaces

Early work focused on the geometric properties of embeddings, particularly in NLP. mu2018all mitigated anisotropy by removing dominant principal components, while liu2019unsupervised stabilized embedding distributions by suppressing high-variance dimensions. ethayarajh2019contextual found that contextualized embeddings often cluster in narrow cones, limiting expressiveness. To improve interpretability, rotation-based alignment (park2017rotated; dufter2019analytical) and probing techniques (dalvi2019one; clark2019does) linked embedding dimensions to human-understandable concepts.

Similar geometric issues, such as feature collapse and variance concentration, also arise in visual representations from self-supervised models like SimCLR (chen2020simple), MoCo (he2020momentum), and BYOL (grill2020bootstrap). wang2020understanding analyzed such effects in contrastive learning, while feature visualization methods (olah2017feature; zhou2016learning) offered neuron-level insights. Recent advances in multimodal embeddings, e.g., CLIP (radford2021learning), ALIGN (jia2021scaling), and DeCLIP (li2022supervision), explore joint semantic alignment across modalities. Unlike these approaches, which often modify embedding spaces, SAFARI preserves native geometry while enhancing interpretability.

Semantic and Hierarchical Structure Discovery

A complementary line of research seeks to uncover semantic groupings and hierarchies within embedding spaces. In NLP, simhi2023interpreting maps embeddings into conceptual spaces grounded in knowledge bases, enabling flexible interpretations. In vision, goh2021multimodal found that certain CLIP neurons respond to abstract concepts shared across modalities. Unsupervised methods (van2020scan; caron2020unsupervised) discover coherent visual clusters but lack hierarchical organization. Conversely, hierarchical classification methods (deng2012hedging; dhall2020hierarchical) build multi-level structures using taxonomies like WordNet, yet depend on external supervision and predefined label trees. SAFARI bridges these gaps with a unified, unsupervised, and modality-agnostic framework that identifies semantic hierarchies directly from embedding spaces–without structural transformation or supervision.

3 Problem Formulation

Vector Space Foundation

We model embedding spaces as a vector space $\mathbb{R}^{d}$ . For ease of explaining the core concepts in this paper, we use natural language terms as illustrative examples, though all concepts introduced are modality-agnostic. Let $h:{\mathcal{T}}\rightarrow{\mathcal{E}}$ be a deep model that maps real-world terms ${\mathcal{T}}$ to embedding vectors ${\mathcal{E}}$ . A central assumption is that geometric distances reflect semantic similarities, a principle validated in various tasks (karpukhin2020dense; lewis2020retrieval; ram2023context). In this work, we adopt cosine distance as a proxy for semantic dissimilarity.

Definition 3.1 (Semantic Distance):

The semantic distance $d_{sem}(\cdot,\cdot)$ between two embedding vectors ${\bm{u}},{\bm{v}}\in{\mathbb{R}}^{d}$ is defined as $d_{sem}({\bm{u}},{\bm{v}})=1-\tfrac{\langle{\bm{u}},{\bm{v}}\rangle}{\left\lVert{\bm{u}}\right\rVert\left\lVert{\bm{v}}\right\rVert}$ .

Challenge of Context-Dependent Meaning

Despite precise embeddings, semantic meaning depends on context. Linguistic theories such as semantic field theory and componential analysis (ullmann1957principles; nida2015componential) argue that meaning arises only through contextual associations.

Proposition 3.1 (Context-Dependent Meaning):

An embedding vector cannot be semantically interpreted in isolation.

Refer to caption — Figure 1: Contextual interpretation of Apple: Meaning refines as more related terms are introduced.

Example 3.1:

The term Apple is semantically ambiguous and relies on context for disambiguation. As shown in Figure 1, it refers to a tech company when grouped with Mac, IBM, and Windows, Apple, but to a fruit with Apple Tree, Juice, and Banana. More contextual cues yield more precise interpretations, highlighting the context-dependent nature of semantics in embedding spaces. $\triangle$

Semantic Fields in Embedding Spaces

To model context-dependent semantics, we define Semantic Fields as sets of neighboring vectors that contextualize a target embedding vector. For instance, as shown in Example 3.1, the meaning of Apple becomes clearer when surrounded by neighbors like Mac, IBM, and Windows, forming a Semantic Field of Apple. This concept aligns with foundational embedding models such as Word2Vec (Mikolov2013Word2Vec) and BERT (devlin2019bert), where semantics arise from context. To formalize this, we distinguish general embedding vectors ( ${\bm{v}}$ ) from those representing real-world terms ( ${\bm{v}}_{t}$ ), where the latter excludes purely mathematical constructs (e.g., zero vectors).

Definition 3.2 (Semantic Field):

A set ${\mathcal{F}}$ of embedding vectors forms a Semantic Field of radius $\epsilon>0$ if there exists a central vector ${\bm{v}}_{t}\in{\mathcal{F}}$ such that for all ${\bm{u}}_{t}\in{\mathcal{F}}$ , we have $d_{sem}({\bm{u}}_{t},{\bm{v}}_{t})<\epsilon$ .

While Definition 3.2 allows us to examine local structures in an embedding space, it is insufficient for understanding the global organization of semantics.

Research Objective: From Local to Global Semantics

Our goal is to investigate how Semantic Fields collectively shape the global semantic structure of embedding spaces, uncovering how these local Semantic Fields relate, interact, and form coherent semantic hierarchies. To this end, we propose SAFARI, a principled framework that detects and analyzes hierarchical semantic structures by identifying boundaries between Semantic Fields. This method offers an interpretable len for understanding how semantics are organized in high-dimensional embedding spaces.

4 Methodology

4.1 Semantic Field Representation

By Definition 3.2, a Semantic Field is the neighborhood of a target embedding vector. SAFARI identifies the structure of these neighborhoods. However, representing such structures is non-trivial due to the following two key challenges:

•

Handling Expression Variants: Closest neighbors are often populated by variants that are overly similar to the original, offering limited values into its interpretation (mimno2017strange; ethayarajh2019contextual). For example, word synonyms or images of the same object under different lighting conditions contribute little to understanding the underlying concept. A robust representation should exhibit invariance to such variations.
•

Delineating Semantic Field Boundaries: As a Semantic Field expands, it includes more embedding vectors in larger neighborhoods and thus provides richer context for more refined interpretations of the target embedding. However, since semantics naturally form hierarchies, there exists a boundary beyond which the Semantic Field contains embeddings diverse enough for it to transcend from a concrete concept to a more abstract one. It is necessary to determine such boundaries to tell when the Semantic Field starts to represent a new concept.

As illustrated in Figure 2, starting from Coca-Cola, the nearest neighbor is Coke (a lexical variant), while broader semantic context only occurs with terms like Sprite and Pepsi, underscoring the difficulty of managing expression variants and identifying natural semantic boundaries.

Geometric Representation: Semantic Field Subspace (SFS)

To address these challenges, we introduce the Semantic Field Subspaces (SFS), a low-dimensional subspace spanned by semantically related vectors. This geometric representation naturally absorbs expression variants via linear dependence and provides a compact, geometry-preserving abstraction of semantic content.

Definition 4.1 (Semantic Field Subspace (SFS)):

Let ${\mathcal{F}}=\{{\bm{v}}_{1},\cdots,{\bm{v}}_{n}\}$ be a Semantic Field. Its SFS is defined as: ${\mathbb{S}}_{{\mathcal{F}}}=\text{span}({\mathcal{F}})=\{\textstyle\sum_{i=1}^{n}\alpha_{i}{\bm{v}}_{i}\mid\alpha_{i}\in{\mathbb{R}}\}$ .

We compute the basis of ${\mathbb{S}}_{{\mathcal{F}}}$ via SVD on matrix ${\bm{M}}_{{\mathcal{F}}}=[{\bm{v}}_{1},\cdots,{\bm{v}}_{n}]$ , i.e., ${\bm{M}}_{{\mathcal{F}}}={\bm{U}}{\bm{\Sigma}}{\bm{V}}^{\top}$ . This yields a continuous, low-rank representation that captures semantic structure while remaining invariant to redundancy.

SFS Boundary Delineation via Semantic Shift

To delineate boundaries between Semantic Fields, we leverage the hierarchical nature of language semantics, where expanding fields lead to broader, more abstract concepts.

Proposition 4.1 (Hierarchical Semantic Structure):

Semantic hierarchies in natural language are reflected in the geometric structure of embedding spaces.

As a Semantic Field expands, a significant shift in meaning often indicates an evolution to a new field. We quantify this evolution via the Semantic Shift between two subspaces ${\mathbb{S}}_{{\mathcal{F}}_{x}}$ and ${\mathbb{S}}_{{\mathcal{F}}_{new}}$ :

Definition 4.2 (Semantic Shift):

The Semantic Shift between ${\mathbb{S}}_{{\mathcal{F}}_{x}}$ and ${\mathbb{S}}_{{\mathcal{F}}_{new}}$ is defined as:

\Delta F_{sem}({\mathbb{S}}_{{\mathcal{F}}_{x}},{\mathbb{S}}_{{\mathcal{F}}_{new}})=\textstyle\sum_{i}\Delta\sigma_{i}\cdot d_{sem}({\bm{v}}_{i},\tilde{{\bm{v}}}_{i}^{*}),

(1)

where $\Delta\sigma_{i}=|\sigma_{i}-\tilde{\sigma}_{i}|$ captures the dimensional importance shift in singular values $\sigma_{i}\in{\bm{\Sigma}}_{x}$ and $\tilde{\sigma}_{i}\in{\bm{\Sigma}}_{new}$ ; $d_{sem}({\bm{v}}_{i},\tilde{{\bm{v}}}_{i}^{*})$ captures directional change between basis vectors ${\bm{v}}_{i}\in{\bm{V}}^{\top}_{x}$ and their nearest counterparts $\tilde{{\bm{v}}}_{i}^{*}\in{\bm{V}}^{\top}_{new}$ .

Semantic Shift acts as a boundary criterion: A large shift suggests that the new subspace ${\mathbb{S}}_{{\mathcal{F}}_{new}}$ represents a more abstract concept that subsumes ${\mathbb{S}}_{{\mathcal{F}}_{x}}$ , whereas small values reflect refinements within the same semantic field.

4.2 The SAFARI Algorithm

Algorithm Overview

Building on Definitions 4.1 and 4.2, we propose SAFARI, an algorithm to uncover SFSes by monitoring Semantic Shifts during iterative clustering. At each step, SAFARI merges the nearest clusters, resulting in a new subspace. The algorithm then evaluates the Semantic Shift between the new subspace and the previous subspace and checks whether such a shift is significant. If so, the new subspace is identified as a new SFS that subsumes the previous subspaces.

Detailed Procedure

The pseudo-code is provided in Algorithm 1. SAFARI initializes by assigning each vector as a singleton cluster in a set $\Omega$ , and maintains a set $\Phi$ to store the discovered SFSes. It proceeds iteratively with the steps below until only one cluster remains ( $|\Omega|\leq 1$ ):

•

Step 1: Cluster Merging. The two nearest clusters ${\mathcal{C}}_{x}$ and ${\mathcal{C}}_{y}$ are identified using Semantic Distance $d_{sem}({\mathcal{C}}_{x},{\mathcal{C}}_{y})$ , with centroids representing each cluster. They are then merged into a new cluster ${\mathcal{C}}_{new}$ , after which ${\mathcal{C}}_{x}$ and ${\mathcal{C}}_{y}$ are removed, and ${\mathcal{C}}_{new}$ is added to $\Omega$ .
•

Step 2: SFS Delineation. SAFARI constructs the subspaces ${\mathbb{S}}_{new}$ and ${\mathbb{S}}_{x}$ for ${\mathcal{C}}_{new}$ and the larger cluster ${\mathcal{C}}_{x}$ , and computes Semantic Shift $\Delta F_{sem}({\mathbb{S}}_{x},{\mathbb{S}}_{new})$ . A sliding window of size $w$ tracks the recent $w$ values, computing mean $\mu$ and standard deviation $\tau$ . If the current $\Delta F_{sem}({\mathbb{S}}_{x},{\mathbb{S}}_{new})$ exceeds the dynamic threshold ( $\mu+3\tau$ ), ${\mathbb{S}}_{new}$ is added to $\Phi$ as a new SFS.

Input: Embedding set

{\mathcal{E}}\subset{\mathbb{R}}^{d}

, window size

w

;

Output: Set

\Phi

of Semantic Field Subspaces (SFSes);

\Omega\leftarrow

Initialize each

{\bm{v}}_{t}\in{\mathcal{E}}

as its own cluster;

\mu\leftarrow 0

;

\tau\leftarrow 0

;

\Phi\leftarrow\varnothing

;

3 while $\left|\Omega\right|>1$ do

\triangleright

Step 1: Cluster Merging

\{{\mathcal{C}}_{x},{\mathcal{C}}_{y}\}\leftarrow\operatorname*{arg\,min}_{{\mathcal{C}}_{i},{\mathcal{C}}_{j}\in\Omega}d_{sem}({\mathcal{C}}_{i},{\mathcal{C}}_{j})

;

{\mathcal{C}}_{new}\leftarrow{\mathcal{C}}_{x}\cup{\mathcal{C}}_{y}

;

\Omega\leftarrow\Omega\cup\{{\mathcal{C}}_{new}\}\setminus\{{\mathcal{C}}_{x},{\mathcal{C}}_{y}\}

;

\triangleright

Step 2: SFS Delineation

{\mathcal{C}}_{x}\leftarrow\left|{\mathcal{C}}_{x}\right|>\left|{\mathcal{C}}_{y}\right|~?~{\mathcal{C}}_{x}~:~{\mathcal{C}}_{y}

;

{\mathbb{S}}_{x},{\mathbb{S}}_{new}\leftarrow\text{span}({\mathcal{C}}_{x}),\text{span}({\mathcal{C}}_{new})

;

10 Compute

\Delta F_{sem}({\mathbb{S}}_{x},{\mathbb{S}}_{new})

using Eq. 1;

11 if $\Delta F_{sem}({\mathbb{S}}_{x},{\mathbb{S}}_{new})>\mu+3\tau$ then

\Phi\leftarrow\Phi\cup\{{\mathbb{S}}_{new}\}

;

14 Update

\mu

and

\tau

using the last

w

values of

\Delta F_{sem}

;

16 return

\Phi

;

Algorithm 1 SAFARI

Remarks on Design

The use of a sliding window ( $w$ ) and the dynamic threshold ( $\mu+3\tau$ ), which is obtained via parameter study, is essential to accurately identify SFSes because we empirically observe that the baseline values of Semantic Shifts grow gradually as the algorithm proceeds. The dynamic threshold allows SAFARI to adapt to such gradual change and effectively detect local spikes of real significance. Moreover, SAFARI adopts the hierarchical process to reflect the layered nature of semantic relationships: (1) It dynamically determines SFSes based on semantic structures, avoiding relying on pre-defined cluster counts; (2) It reveals natural semantic hierarchies through the resulting dendrogram.

Example 4.1:

Consider a toy dataset of 11 textual terms. Figure 3 shows how SAFARI identifies SFSes. In the first three iterations, semantically close pairs (e.g., Macbook Air & Macbook Pro, PowerPoint & Excel, and Michael Jordan & Chicago Bulls) are merged without forming SFSes. In the 4th iteration, merging Apple with the Macbook cluster triggers a significant Semantic Shift, forming a new SFS. By the 8th iteration, a hierarchical structure emerges with an IT Companies subspace encompassing nested SFSes for Apple (IT Company) and Microsoft (IT Company). In the 9th iteration, the dynamic threshold prevents the merge between IT Companies and NBA, preserving semantic boundaries. $\triangle$

4.3 Efficient Approximation of Semantic Shift

Each iteration of SAFARI requires computing the Semantic Shift via full SVD on matrices of size $n\times d$ ( $d\leq n$ ), incurring a time complexity of $O(nd^{2})$ (halko2009finding; trefethen2022numerical). This becomes a major bottleneck in large-scale applications. To improve scalability, we propose a practical approximation. Let ${\bm{A}}_{x}$ and ${\bm{A}}_{y}$ be the matrices of a large cluster ${\mathcal{C}}_{x}$ and a small one ${\mathcal{C}}_{y}$ . Instead of computing full SVDs, we approximate the Semantic Shift between ${\mathbb{S}}_{x}$ and ${\mathbb{S}}_{new}$ as:

\Delta\tilde{F}_{sem}({\mathbb{S}}_{x},{\mathbb{S}}_{new})=\left\lVert{\bm{A}}_{y}\right\rVert_{2}\sigma_{max}({\bm{A}}_{x}),

(2)

where $\left\lVert{\bm{A}}_{y}\right\rVert_{2}$ is the spectral norm of ${\bm{A}}_{y}$ , and $\sigma_{max}({\bm{A}}_{x})$ is the largest singular value of ${\bm{A}}_{x}$ . This yields substantial speedups with negligible loss of accuracy (see Section 5.4).

Theoretical Justification

Let ${\bm{A}}_{new}=[{\bm{A}}_{x}|{\bm{A}}_{y}]$ be the matrix representing the newly merged cluster ${\mathcal{C}}_{new}$ . We now justify the approximation by establishing two key theoretical results:

•

An upper bound on the dimensional importance shift in singular values;
•

A connection between directional change and the largest singular value of ${\bm{A}}_{x}$ .

Bounding Dimensional Importance Shift

We begin with the following result:

Theorem 4.1 (Bound on Dimensional Importance Shift):

Given matrices ${\bm{A}}_{x}$ and ${\bm{A}}_{y}$ with the same number of columns and assuming ${\bm{A}}_{x}$ has more rows than ${\bm{A}}_{y}$ , the shift in the $i$ -th singular value satisfies:

\Delta\sigma_{i}=|\sigma_{i}({\bm{A}}_{x})-\sigma_{i}({\bm{A}}_{new})|\leq\left\lVert{\bm{A}}_{y}\right\rVert_{2}.

Proof: The result follows from Weyl’s Theorem (weyl1912asymptotische), which bounds the change in singular values under additive perturbations. Consider the larger cluster represented by matrix ${\bm{A}}\in{\mathbb{R}}^{m\times d}$ . When merging with another cluster, the resulting matrix ${\bm{\tilde{A}}}$ can be viewed as a perturbed version of ${\bm{A}}$ , i.e., ${\bm{\tilde{A}}}={\bm{A}}+{\bm{E}}$ , where ${\bm{E}}$ is the perturbed matrix.

Theorem 4.2 (Weyl’s Theorem (weyl1912asymptotische)):

For any perturbed matrix ${\bm{E}}$ , the singular values satisfy: $|\sigma_{i}({\bm{A}})-\sigma_{i}({\bm{\tilde{A}}})|=|\sigma_{i}({\bm{A}})-\sigma_{i}({\bm{A}}+{\bm{E}})|\leq\left\lVert\bm{E}\right\rVert_{2}$ .

This result implies that the change in any singular value is at most the spectral norm of the perturbed matrix, regardless of its dimension (stewart1998perturbation). To apply this, we rewrite ${\bm{A}}_{new}=[{\bm{A}}_{x}|{\bm{A}}_{y}]=[{\bm{A}}_{x}|{\bm{O}}]+[{\bm{O}}|{\bm{A}}_{y}]$ , where ${\bm{O}}$ is a zero matrix. According to Theorem 4.2, we have:

	$\displaystyle\|\sigma_{i}([{\bm{A}}_{x}\|{\bm{O}}])-\sigma_{i}({\bm{A}}_{new})\|$	$\displaystyle=\|\sigma_{i}([{\bm{A}}_{x}\|{\bm{O}}])-\sigma_{i}([{\bm{A}}_{x}\|{\bm{A}}_{y}])\|$
		$\displaystyle\leq\left\lVert[{\bm{O}}\|{\bm{A}}_{y}]\right\rVert_{2}=\left\lVert{\bm{A}}_{y}\right\rVert_{2}.$

Since $\sigma_{i}([{\bm{A}}_{x}|{\bm{O}}])=\sigma_{i}({\bm{A}}_{x})$ , Theorem 4.1 is proved. ∎

Approximating Directional Change

Intuitively, $\sum_{i}d_{sem}({\bm{v}}_{i},{\tilde{\bm{v}}}_{i}^{*})$ captures the perturbation in basis directions during cluster merging. Following (meyer2000matrix; belsley2005regression), it is expected that a higher sensitivity in ${\bm{A}}_{x}$ , typically characterized by its condition number $\kappa({\bm{A}}_{x})={\sigma_{max}({\bm{A}}_{x})}/{\sigma_{min}({\bm{A}}_{x})}$ , will result in larger directional changes when ${\bm{A}}_{x}$ and ${\bm{A}}_{y}$ are merged, i.e., $\sum_{i}d_{sem}({\bm{v}}_{i},{\tilde{\bm{v}}}_{i}^{*})=\mathcal{O}(\kappa({\bm{A}}_{x}))$ .

Since ${\bm{A}}_{x}$ and ${\bm{A}}_{y}$ represent embedding vectors sampled similarly from the embedding space, the impact of noise on these vectors are similar. Therefore, their minimum singular values can be assumed comparable, i.e, $\sigma_{min}({\bm{A}}_{x})\approx\sigma_{min}({\bm{A}}_{y})\approx\sigma_{min}({\bm{A}}_{new})$ . This allows us to discard the effects of minimum singular values, and the directional change can be approximated by $\sigma_{max}({\bm{A}}_{x})$ , i.e., $\sigma_{max}({\bm{A}}_{x})\approx\mathcal{O}(\kappa({\bm{A}}_{x}))\approx\sum_{i}d_{sem}({\bm{v}}_{i},{\tilde{\bm{v}}}_{i}^{*})$ .

While this approximation is necessarily coarse, it offers an efficient alternative for real-time clustering. Empirical results (Section 5.5) confirm its efficacy with minimal impact on performance.

5 Experiments

5.1 Datasets and Experiment Environment

Datasets

We evaluate SAFARI and SFSes on six public datasets across text and image modalities. For the text modality, we use five diverse datasets: AG-News (zhang2015character) (with 4 topic classes: Business, Sci/Tech, Sports, and World), AAPD (yang2018sgm), IMDB (maas2011learning), Yelp,¹¹1https://www.yelp.com/dataset and NewsSpectrum (sun2024diversinews). For the image modality, we employ MIT-States (isola2015discovering), which enables the evaluation of object-attribute composition in visual embeddings. Further dataset details are available in Appendix A.1.

Experiment Environment

All methods were implemented in Python 3.8 and evaluated on a Ubuntu 20.04 machine with Intel^® Xeon^® Platinum 8480C and an NVIDIA H100 GPU.

5.2 Hierarchical Semantic Structure Discovery

Experimental Setup

We evaluate SAFARI and the resulting SFSes on two modality-diverse datasets: AG-News for text and MIT-States for images. Embedding spaces are constructed using BLINK (wu2020scalable) for AG-News and CLIP (radford2021learning) for MIT-States. To benchmark hierarchical discovery, we generate 4-level semantic label hierarchies (Lv0 to Lv3) for both datasets using Claude 3.7 Sonnet (anthropic2025claude), with details provided in Appendix A.2. We assess how well the SFSes align with the reference hierarchy using the impurity metric:

\text{Impurity}=\textstyle\frac{1}{l}\sum_{i=1}^{l}(1-\frac{1}{|L_{i}|}\max_{1\leq j\leq c}|L_{i}\cap C_{j}|),

where $L_{i}$ is label class $i$ ; $C_{j}$ is cluster $j$ used to construct SFSes; $l$ and $c$ are the number of label classes and clusters. Lower impurity indicates greater semantic coherence within clusters, with 0 denoting perfect label concentration within clusters. As SAFARI progresses, it is expected that impurity grows due to the merging of semantically broader categories, with a consistent ordering across levels: Lv0 (most specific) $>$ Lv1 $>$ Lv2 $>$ Lv3 (most abstract).

Method	Text Classification				Image Classification
Method	Precision (%) $\uparrow$	Recall (%) $\uparrow$	F1-score (%) $\uparrow$	Time (s) $\downarrow$	Precision (%) $\uparrow$	Recall (%) $\uparrow$	F1-score (%) $\uparrow$	Time (s) $\downarrow$
SAFARI	48.3	49.3	48.5	46.37	61.4	61.1	60.5	18.64
SVM	47.5	47.8	47.6	91.87	63.5	62.7	62.1	69.58
KNN	41.9	43.1	42.1	1.167	58.5	57.5	57.0	2.855
MLP	43.4	41.6	42.1	111.5	54.7	54.4	54.3	92.55
RF	35.7	40.3	37.8	35.64	56.0	55.3	54.0	105.6

Table 1: Classification results for text and image modalities. Bold and underlined denote the best and second-best scores, respectively. SAFARI leads on text classification and ranks second on image classification, balancing accuracy and efficiency.

Results and Analysis

Figure 4 shows a consistent decrease in impurity from Lv0 to Lv3 across iterations for both modalities. This trend reflects a hierarchical shift from specific to abstract semantics, confirming that SFSes capture coherent semantic groupings at multiple granularities.

The consistent patterns across text and image confirm the modality-agnostic nature of SAFARI. Without supervision, it uncovers increasingly abstract semantic relationships by tracking Semantic Shifts. By revealing how local neighborhoods compose global hierarchies, SAFARI offers a robust and generalizable framework for identifying hierarchical structures in embedding spaces, advancing our understanding of their inherent semantic organization.

5.3 Classification Across Modalities

Experimental Setup

We measure whether SFSes preserve meaningful semantics by testing their performance on classification tasks across text and image modalities. For text classification, we use four datasets: AG-News (4 topics), AAPD, IMDB, and Yelp, covering seven distinct classes. For image classification, we use MIT-States, filtering to 97 object classes with at least 240 samples each. We compare SAFARI against four standard classifiers: Support Vector Machine (SVM) (platt1999probabilistic; chang2011libsvm), K-Nearest Neighbors (KNN) (cover1967nearest; fix1985discriminatory), Random Forest (RF) (breiman2001random), and Multi-Layer Perceptron (MLP) (he2015delving; hinton1990connectionist). For classification using SFSes, we compute the distance between each test embedding and all identified SFSes, assigning the label to the nearest one. Results are presented in Table 1.

Results and Analysis

For text classification, SAFARI outperforms all baselines, surpassing SVM, the second-best, while using only around 50% of the computation time. In image classification, it ranks second to SVM in accuracy but runs 3.7 $\times$ faster, demonstrating a strong accuracy-efficiency trade-off. Overall, SAFARI delivers competitive performance across modalities with notable computational savings, confirming that SFSes offer effective and efficient semantic representations.

We further evaluate political bias detection on the NewsSpectrum dataset (details in Appendix B). Results show that SAFARI successfully captures nuanced ideological distinctions, where standard classifiers often fail, particularly on underrepresented political leanings, highlighting its robustness in modeling subtle, real-world semantics beyond surface-level topics.

5.4 Efficient Semantic Shift Approximation

Experimental Setup

To evaluate the efficiency of SAFARI’s approximate Semantic Shift computation (Equation 2), we compare it against the full SVD-based method (Equation 1), in terms of runtime. Following the setup in Section 5.3, we sample the top 2,000 entities from each dataset and perform hierarchical clustering. Figure 5 depicts the results, with runtime averaged over 10 independent runs.

Results and Analysis

As shown in Figure 5, the approximate method achieves a 15 $\sim$ 30 $\times$ speedup over full SVD across all classes, with consistently low variance (as indicated by the error bars). Despite the substantial acceleration, the average error between exact and approximate Semantic Shifts remains below 0.01, within the $10^{-3}$ scale, ensuring strong accuracy-efficiency trade-offs. These results confirm that our approximation is a fast, stable, and reliable alternative to full SVD, making SAFARI scalable for large datasets. A detailed analysis of the two Semantic Shift components, i.e., Dimensional Importance Shift and Directional Change, is provided in Appendix C.

5.5 Case Study: Discovering Semantic Hierarchies in AG-News

Experimental Setup

To illustrate SAFARI’s ability to uncover hierarchical semantics, we conduct a case study on the Sports category of the AG-News dataset, chosen for its structured, event-driven content. We apply SAFARI to the top 2,000 entity embeddings, computing Semantic Shifts at each iteration using both the exact (Equation 1) and approximate (Equation 2) methods. Figure 6 plots Semantic Shift curves between iterations 11,000 and 16,000, with notable evolutions at iterations 11,352 and 15,856. Figures 7 and 8 visualizes the corresponding hierarchical groupings.

Results and Analysis

Figure 6 shows that the approximate Semantic Shift curve closely mirrors the exact one, achieving a Pearson correlation of 0.92. This strong correlation confirms the reliability of our efficient approximate method introduced in Section 4.3. Moreover, SAFARI’s sliding-window-based dynamic thresholding effectively adapts to the non-uniform fluctuation of shifts across iterations, enable the discovery of subtle yet meaningful SFSes that would be missed by static thresholds. Further parameter study about the dynamic threshold mechanism in SAFARI is detailed in Appendix D.

SAFARI also reveals hierarchical semantic relationships with fine-to-coarse granularity. In Figure 7, early clusters capture specific U.S. university basketball and football teams, which gradually merge into broader categories like university sports teams. Figure 8 illustrates cross-national grouping: teams initially cluster by country, then merge into regional groupings. Notably, European teams (blue) consolidate more tightly than non-European teams (red), indicating structure-aware semantic abstraction. These results showcase SAFARI’s ability to track evolving semantic organization, understanding latent hierarchies in embedding spaces. Additional analysis is provided in Appendix E.

6 Conclusions

In this paper, we tackle the fundamental challenge of understanding the abstract and intricate structure of embedding spaces. We introduce SFSes as a structured representation that explicitly links embedding spaces to their underlying semantics. Leveraging hierarchical clustering and the concept of Semantic Shift, we develop SAFARI, an effective algorithm that uncovers hierarchical semantic structures while maintaining computational scalability through an efficient approximation of Semantic Shift. Through comprehensive experiments on six real-world datasets spanning text and image modalities, we show that SFSes improve performance on both standard classification tasks and subtle semantic challenges like political bias detection. SAFARI consistently reveals meaningful, modality-agnostic semantic hierarchies with minimal computational overhead. By bridging the gap between geometric embedding representations and their underlying semantics, this work opens new avenues for future research, like semantic-aware embedding analysis and knowledge discovery.

Reproducibility Checklist

Instructions for Authors:

This document outlines key aspects for assessing reproducibility. Please provide your input by editing this .tex file directly.

For each question (that applies), replace the “Type your response here” text with your answer.

Example: If a question appears as

\question{Proofs of all novel claims are included} {(yes/partial/no)}
Type your response here

you would change it to:

\question{Proofs of all novel claims are included} {(yes/partial/no)}
yes

Please make sure to:

•

Replace ONLY the “Type your response here” text and nothing else.
•

Use one of the options listed for that question (e.g., yes, no, partial, or NA).
•

Not modify any other part of the \question command or any other lines in this document.

You can \input this .tex file right before \end{document} of your main file or compile it as a stand-alone document. Check the instructions on your conference’s website to see if you will be asked to provide this checklist with your paper or separately.

1. General Paper Structure

1.1.

Includes a conceptual outline and/or pseudocode description of AI methods introduced (yes/partial/no/NA) yes
1.2.

Clearly delineates statements that are opinions, hypothesis, and speculation from objective facts and results (yes/no) yes
1.3.

Provides well-marked pedagogical references for less-familiar readers to gain background necessary to replicate the paper (yes/no) yes

2. Theoretical Contributions

2.1.

Does this paper make theoretical contributions? (yes/no) yes
If yes, please address the following points:
- 2.2.
  
  All assumptions and restrictions are stated clearly and formally (yes/partial/no) yes
- 2.3.
  
  All novel claims are stated formally (e.g., in theorem statements) (yes/partial/no) yes
- 2.4.
  
  Proofs of all novel claims are included (yes/partial/no) yes
- 2.5.
  
  Proof sketches or intuitions are given for complex and/or novel results (yes/partial/no) yes
- 2.6.
  
  Appropriate citations to theoretical tools used are given (yes/partial/no) yes
- 2.7.
  
  All theoretical claims are demonstrated empirically to hold (yes/partial/no/NA) yes
- 2.8.
  
  All experimental code used to eliminate or disprove claims is included (yes/no/NA) yes

3. Dataset Usage

3.1.

Does this paper rely on one or more datasets? (yes/no) yes
If yes, please address the following points:
- 3.2.
  
  A motivation is given for why the experiments are conducted on the selected datasets (yes/partial/no/NA) yes
- 3.3.
  
  All novel datasets introduced in this paper are included in a data appendix (yes/partial/no/NA) NA
- 3.4.
  
  All novel datasets introduced in this paper will be made publicly available upon publication of the paper with a license that allows free usage for research purposes (yes/partial/no/NA) NA
- 3.5.
  
  All datasets drawn from the existing literature (potentially including authors’ own previously published work) are accompanied by appropriate citations (yes/no/NA) yes
- 3.6.
  
  All datasets drawn from the existing literature (potentially including authors’ own previously published work) are publicly available (yes/partial/no/NA) yes
- 3.7.
  
  All datasets that are not publicly available are described in detail, with explanation why publicly available alternatives are not scientifically satisficing (yes/partial/no/NA) NA

4. Computational Experiments

4.1.

Does this paper include computational experiments? (yes/no) yes
If yes, please address the following points:
- 4.2.
  
  This paper states the number and range of values tried per (hyper-) parameter during development of the paper, along with the criterion used for selecting the final parameter setting (yes/partial/no/NA) yes
- 4.3.
  
  Any code required for pre-processing data is included in the appendix (yes/partial/no) yes
- 4.4.
  
  All source code required for conducting and analyzing the experiments is included in a code appendix (yes/partial/no) yes
- 4.5.
  
  All source code required for conducting and analyzing the experiments will be made publicly available upon publication of the paper with a license that allows free usage for research purposes (yes/partial/no) yes
- 4.6.
  
  All source code implementing new methods have comments detailing the implementation, with references to the paper where each step comes from (yes/partial/no) yes
- 4.7.
  
  If an algorithm depends on randomness, then the method used for setting seeds is described in a way sufficient to allow replication of results (yes/partial/no/NA) NA
- 4.8.
  
  This paper specifies the computing infrastructure used for running experiments (hardware and software), including GPU/CPU models; amount of memory; operating system; names and versions of relevant software libraries and frameworks (yes/partial/no) yes
- 4.9.
  
  This paper formally describes evaluation metrics used and explains the motivation for choosing these metrics (yes/partial/no) yes
- 4.10.
  
  This paper states the number of algorithm runs used to compute each reported result (yes/no) yes
- 4.11.
  
  Analysis of experiments goes beyond single-dimensional summaries of performance (e.g., average; median) to include measures of variation, confidence, or other distributional information (yes/no) no
- 4.12.
  
  The significance of any improvement or decrease in performance is judged using appropriate statistical tests (e.g., Wilcoxon signed-rank) (yes/partial/no) no
- 4.13.
  
  This paper lists all final (hyper-)parameters used for each model/algorithm in the paper’s experiments (yes/partial/no/NA) yes

Appendix A Experiment Details

Dataset Details

We evaluate SAFARI on six widely used real-world datasets spanning both text and image modalities, demonstrating its effectiveness and generalizability across domains. For the text modality, we use five diverse datasets:

•

AG-News (zhang2015character) consists of over 1 million articles from 2,000+ sources, categorized into four topics: Business, Sci/Tech, Sports, and World. This serves as our primary dataset due to its scale and well-defined semantic structures.
•

AAPD (yang2018sgm) contains 55,840 arXiv abstracts from computer science, each labeled with one or more subject areas, supporting multi-label classification tasks.
•

IMDB (maas2011learning) comprises 50,000 movie reviews split evenly into training and testing sets, designed for binary sentiment classification.²²2https://www.imdb.com/
•

Yelp includes 6.99 million user reviews with 150,000+ business attributes, enabling fine-grained semantic analysis.³³3https://www.yelp.com/dataset
•

NewsSpectrum (sun2024diversinews) offers 250,000 politically diverse news articles from Reddit. It offers a balanced distribution across the ideological spectrum, making it well-suited for studying abstract semantic phenomena such as political bias.

For the image modality, we use MIT-States dataset (isola2015discovering), which comprises $\sim$ 53,000 images labeled with 245 object classes and 115 attribute states. It is specifically designed to evaluate models on object-attribute compositions and compositional generalization.

Hierarchical Label Structures For AG-News and MIT-States

To evaluate semantic coherence across multiple granularities, we construct four-level hierarchies for both datasets:

•

AG-News: Category $\rightarrow$ Subcategory $\rightarrow$ Semantic Group $\rightarrow$ Entity;
•

MIT-States: Category $\rightarrow$ Subcategory $\rightarrow$ Object $\rightarrow$ Attribute.

These hierarchies capture increasingly specific relationships, ranging from broad domains to specific entities in AG-News, as well as conceptual relationships such as Materials & Substances $\rightarrow$ Metals $\rightarrow$ Steel $\rightarrow$ Unpainted in MIT-States.

Prompts for AG-News

We employ a two-stage prompting strategy using Claude 3.7 Sonnet to construct the AG-News hierarchy. Directly prompting with the full entity list often causes omission due to context length limitations. To mitigate this, we begin with a small subset of entities and prompt Claude to generate a hierarchy with fixed top-level categories (Figure A1). Then, we iteratively expand the hierarchy by combining previous outputs with new subsets of entities (Figure A2), until all entities are processed.

Prompts for MIT-States

For MIT-States, we filter out images and categories that lack hierarchical depth, either due to having only one image (e.g., dog, car) or missing parent categories. The prompt used to construct valid hierarchical relationships is shown in Figure A3.

Hierarchical Labels for AG-News and MIT-States

The complete hierarchical label mappings for AG-News and MIT-States are provided in Tables A1 and A2, respectively. These serve as ground truth for evaluating semantic coherence at different levels of abstraction.

Table A1: Hierarchical labels for AG-News dataset.

Category	Subcategory	Semantic Group
Sports	Olympic Sports	Teams & Organizations, Events & Competitions, Athletes, Venues
	Basketball	Teams & Seasons, Players & Personnel, Venues, Organizations & Events
	American Football	Teams & Seasons, Players & Personnel, Venues & Concepts
	Baseball	Teams & Seasons, Players & Personnel, Events & Concepts
	Other Sports	Golf & Tennis, Racing & Motorsports, Combat Sports, Rugby & Cricket, Horse Racing, Soccer & Football, Swimming & Water Sports, Other Sporting Events, Other Sports Personnel, Winter Sports, Teams & Organizations, Players & Personnel, Venues & Events, Cricket & Rugby, Other Sports Events, Other Sports
	Soccer & Football	Teams & Organizations, Players & Personnel, Venues & Events
Business	Financial Services	Banking, Investment & Asset Management, Insurance & Risk Management, Consulting & Advisory
	Corporations & Industries	Manufacturing & Industrial, Retail & Consumer Goods, Automotive, Transportation & Logistics, Energy & Resources, Technology & Telecommunications
	Technology Companies	Software & IT, Security & Cybersecurity, Media & Entertainment, Hardware & Computing, Technology Services
	Retail & Consumer Goods	Retail Companies, Food & Beverage, Marketing & Advertising
	Corporate Entities	Corporations & Conglomerates, Executives & Entrepreneurs
World	Politics & Government	Political Figures, Government Organizations, Political Events & Issues, International Relations, Political Movements
	Cultural & Social	Literature & Writers, Social Groups, Media & Entertainment Figures, Arts & Culture, Social Issues
	Law & Justice	Legal Cases & Legislation, Legal Professionals, Crime & Legal Issues
	Military & Security	Military Conflicts, Organizations, Personnel, Security & Intelligence
Science-Tech	Space & Astronomy	Space Exploration, Astronomical Research, Astronomical Objects
	Computing & Technology	Software & Development, Hardware & Devices, Internet & Telecom, Digital Media, IT Infrastructure
	Medical & Health	Medical Technology, Research, Healthcare Organizations
	Environmental Science	Climate & Earth Sciences, Environmental Events

Table A2: Hierarchical labels for the MIT-States dataset.

Category	Subcategory	Object
Materials and Substances	Metals	Aluminum, Brass, Bronze, Copper, Metal, Steel
	Natural Materials	Clay, Cotton, Fabric, Foam, Paper, Paste, Plastic, Silk, Velvet, Wool
	Earth Elements	Concrete, Dirt, Granite, Ground, Mud, Rock, Sand, Stone
Food and Consumables	Proteins	Beef, Chicken, Fish, Meat, Salmon, Seafood
	Produce	Apple, Fruit, Pear, Tomato, Vegetable, Potato
	Prepared Foods	Bread, Cheese, Cookie, Eggs, Pie, Pizza, Soup
Built Environment	Structures	Building, Castle, Church, House, Wall
	Spaces	Bathroom, Kitchen, Room
	Furniture and Fixtures	Cabinet, Chair, Lightbulb, Tile
	Transportation Infrastructure	Highway, Road, Street
Nature	Bodies of Water	Lake, Pond, Pool
	Landforms	Canyon, Valley
	Flora	Forest, Plant, Redwood, Tree
	Sky Elements	Cloud, Sky
	Agricultural	Farm
Consumer Goods	Clothing and Accessories	Bracelet, Clothes, Coat, Dress, Necklace, Pants, Ribbon, Ring, Shirt, Shorts
Consumer Goods	Household Items	Bag, Blade, Bottle, Camera, Carpet, Clock, Glass, Knife, Rope, Toy

Table A3: Detailed experiment settings.

Experiment	Datasets	Models / Embeddings	Parameters	Metrics	Baselines
Hierarchical Structure Discovery in Text	AG-News	BLINK	Dynamic thresholding with sliding window	Impurity	N/A
Hierarchical Structure Discovery in Image	MIT-States (53K images)	CLIP	Dynamic thresholding with sliding window	Impurity	N/A
Topic Classification	AG-News, AAPD, IMDB, Yelp	BLINK	Top-n entities	Precision, Recall, F1-score	SVM, KNN, RF, MLP
Image Classification	MIT-States	CLIP	Top-n dimensions (5%)	Precision, Recall, F1-score	SVM, KNN, RF, MLP
Computational Efficiency	AG-News Sampled	BLINK	Dynamic thresholding with sliding window	Runtime, Average error	SAFARI (Full SVD)
Political Bias Detection	NewsSpectrum	AnglE	Top-n entities	Runtime, F1-score	SVM, KNN, RF, MLP
Component Analysis of Semantic Shift	AG-News Sampled	AnglE	N/A	DIS/DC ratio, Mean, Std, Median	N/A
Parameter Study	AG-News Sampled	AnglE	Min window: 50-200, Std mul: 0.5-3.0	CV, P90/P10, Max/Min ratio	N/A

Detailed Experiment Settings

To comprehensively evaluate SAFARI and the effectiveness of SFSes, we conduct a series of experiments across diverse datasets and tasks, summarized in Table A3.

Hierarchical Structure Discovery

We apply SAFARI in a fully unsupervised setting across both text and image modalities. For the text modality, we use the AG-News dataset with four categories. To reduce noise from common entities (e.g., Reuters) that appear across all categories, we retain only those unique to each class. For the image modality, we use MIT-States, comprising 53,000 images across 245 object and 115 attribute classes. Four-level semantic hierarchies (Lv0 to Lv3) are generated for both datasets using Claude 3.7 Sonnet, as detailed in Appendix A.

We evaluate the discovered semantic structures using the impurity metric that measures label heterogeneity within clusters, ranging from 0 (perfect homogeneity) to higher values (increased mixing). Successful hierarchy discovery is indicated by both increasing diversity values with iterations and a consistent Lv0 $>$ Lv1 $>$ Lv2 $>$ Lv3 ordering that mirrors the progression from specific to abstract concepts.

Classification

SFSes are constructed using class-labeled training data. Test samples are classified by computing cosine distances to all SFSes and assigning the label of the nearest subspace. For text classification, we use all embedding dimensions weighted by singular values; for image classification, we use only the top 5% of dimensions without weighting. This distinction reflects the higher variability and noise (e.g., background features) in image embeddings, which make full-dimensional or weighted comparisons less effective. We report precision, recall, and F1 score, and compare against four standard baselines: Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), and Multi-Layer Perceptron (MLP).

•

Topic Classification: We use four text datasets: AG-News, AAPD, IMDB, and Yelp. Entity embeddings are derived using BLINK (wu2020scalable) for entity linking, followed by TF-IDF-based filtering (schutze2008introduction; leskovec2020mining) to retain the top 2,100 entities.
•

Image Classification: We use the MIT-States dataset with CLIP (radford2021learning) to extract image embeddings. We retain 97 object classes with at least 240 samples each to ensure class balance.

For both modalities, embeddings are split into 80% training and 20% testing.

Computational Efficiency

We apply SAFARI to the sampled AG-News dataset and compare the runtime of our approximate Semantic Shift computation against the full SVD-based variant. We report runtime and error rates to quantify the trade-off between efficiency and accuracy.

Political Bias Detection

To assess SAFARI’s capability in capturing nuanced ideological distinctions beyond surface-level topics, we conduct a political bias detection task using the NewsSpectrum dataset. The dataset contains articles classified into five political categories (Left, Lean Left, Center, Lean Right, and Right), with labels derived from media source affiliations rather than individual content analysis. Articles are embedded using the AnglE (li2024aoe) sentence transformer, and the model is evaluated in a supervised classification setup with 80%/20% train-test split. Evaluation metrics include F1 score and training time, and SAFARI is again compared with the same four baselines as the classification task.

Component Analysis of Semantic Shift

To understand the internal mechanisms of Semantic Shift, we conduct an experiment to assess the effectiveness of its two components: Dimensional Importance Shift (DIS) and Directional Change (DC) on the sampled AG-News dataset. To quantify their relative contributions, we report their mean, median, and standard deviation values.

Parameter Study

Finally, we perform a parameter study on the SAFARI’s dynamic thresholding mechanism. We vary the minimum window size and test a range of standard deviation multipliers (0.5 to 3.0, with a step of 0.5). To evaluate the uniformity of detected semantic shifts, we report two metrics: Coefficient of Variation (CV) and Max/Min ratio.

Appendix B Political Bias Detection

Beyond conventional classification in Section 5.3, we explore the capability of SAFARI in detecting more abstract semantic patterns through political bias detection. Unlike topic classification, where semantic differences are often explicit and content-driven, political bias manifests in subtle linguistic choices and ideological framing that transcend specific topics. This presents a more challenging test for our method: can SFS identify and represent these nuanced semantics that shape political orientation?

Experimental Setup

We employ NewsSpectrum to assess whether SFSes capture abstract semantic patterns that reflect ideological perspectives. Articles are categorized into five bias groups: Left, Lean Left, Center, Lean Right, and Right, though these labels are inherently imprecise as they are assigned based on media sources rather than content. We use AnglE embeddings (li2024aoe), which achieves state-of-the-art performance in the MTEB benchmark (muennighoff2023mteb). Since AnglE is a sentence-transformer model, these vectors represent articles rather than entities.

Following the setup in Section 5.3, we compare SAFARI against four standard classifiers: SVM, KNN, RF, and MLP, using 80% of the data for training and 20% for testing. The results are displayed in Figures B4 and B5.

Results and Analysis

The results in Figure B4 highlight that SAFARI achieves the best balance between classification performance and computational efficiency for political bias detection. SAFARI achieves the highest F1 score (0.45) while maintaining a low training time (21.9 seconds), making it the most practical choice. In comparison, MLP reaches a slightly lower F1 score (0.39) but incurs a significantly higher training time (1,033.28 seconds). Other classifiers exhibit varying efficiency-performance trade-offs: SVM and KNN both have an F1 score of 0.36, yet SVM demands extensive training time (27.29 hours), whereas KNN is the fastest (1.64 seconds) but offers weaker classification accuracy. RF performs the worst, with an F1 score of 0.29 and a substantial training time of 1,684.05 seconds. These results underscore the effectiveness of SAFARI, making it scalable and practical for political bias detection tasks.

Figure B5 further validates SAFARI’s superior and well-balanced performance in political bias detection compared to standard classifiers. SAFARI maintains stable F1 scores across all five categories (ranging from 0.37 to 0.51), whereas standard classifiers exhibit substantial fluctuations and consistently poor performance in lean positions. Notably, SVM and RF achieve only 0.01–0.04 F1 scores for Lean Left and Lean Right categories, with only KNN performing slightly better at 0.21. Furthermore, while some standard classifiers, such as SVM and MLP, show relatively higher F1 scores for center classification, they struggle significantly with Lean Left and Lean Right classifications. In contrast, SAFARI delivers robust and consistent performance across the entire political spectrum. This advantage stems from its unique approach of using SFSes to represent different political categories.

Unlike standard classifiers with rigid decision boundaries, SAFARI uses subspaces whose basis vectors naturally encode both shared and category-specific semantic patterns. Common linguistic structures or ideological overlaps between categories emerge as shared directions within the subspaces, while category-specific traits are preserved in distinct dimensions. This enables SAFARI to recognize when an article aligns semantically with multiple political categories, supporting nuanced, context-aware classification across the political spectrum.

Appendix C Component Analysis of Semantic Shift

Experimental Setup

To dissect the inner workings of Semantic Shift, we conduct a controlled experiment that quantifies the individual contributions of its two components: Dimensional Importance Shift (DIS) and Directional Change (DC). Our analysis is based on 9,778 valid samples drawn from SAFARI’s clustering process, which reveals distinct patterns in their respective impacts. The results are presented in Table C4.

Table C4: Statistics of the contributions of the DIS and DC components.

Metric	DIS	DC
Mean	5.60	3.19
Median	3.97	2.36
Standard Deviation	7.41	3.25
Log Contribution	59.3%	40.7%

Results and Analysis

Our findings indicate that DIS (5.60) is the dominant factor in stability measurement, contributing approximately 1.8 $\times$ more than DC (3.19) on average. The right-skewed distribution, evidenced by lower median values compared to means, indicates the presence of significant outliers where DIS is dramatically more influential. While DIS (59.3%) demonstrates greater overall impact, accounting for nearly 60% of the effect, DC (40.7%) remains a substantial contributor at approximately 40%. This confirms that both mechanisms play significant roles in the stability dynamics, though with DIS exerting the stronger influence in most scenarios.

Appendix D Parameter Study about the Dynamic Threshold Mechanism in SAFARI

To robustly detect semantic transitions throughout clustering, SAFARI employs a dynamic thresholding mechanism based on a sliding window. Rather than using a fixed window size across all iterations, it applies a recursive divide-and-conquer strategy: the sequence of semantic shifts is iteratively split into smaller segments based on distributional imbalance between halves. This process continues until each segment either reaches a predefined minimum size or exhibits sufficiently balanced distribution. This adaptive strategy prevents extreme Semantic Shifts in later iterations from overshadowing earlier, subtler transitions, ensuring more balanced detection across the entire clustering process.

We study how two parameters affect the behavior of this dynamic threshold mechanism: (1) the Standard Deviation Multiplier (SDM) and (2) the Minimum Window Size (MWS) used to define the dynamic threshold. We evaluate the uniformity of detected semantic shifts across different settings to identify configurations that balance sensitivity to meaningful changes with temporal consistency.

Experimental Setup

To evaluate the uniformity of Semantic Shifts, we use two complementary metrics: (1) Coefficient of Variation (CV): measures dispersion relative to the mean; lower values imply greater uniformity. (2) Max/Min Ratio: captures the full range of detected shift magnitudes; smaller values indicate better balance across iterations. Results are shown in Table D5.

Table D5: Performance metrics across standard multipliers.

SDM	MWS	CV $\downarrow$	Max/Min Ratio $\downarrow$
3.0	50-200	3.75	1,540.6
2.5	50-200	4.42	1,811.3
2.0	50-200	5.10	2,193.1
1.5	50-200	5.84	2,778.6
1.0	50-200	6.83	3,599.7
0.5	50-200	6.78	4,384.8

Results and Analysis

The configuration with a standard deviation multiplier of 3.0 consistently achieves the most uniform distribution of Semantic Shifts (CV = 3.75, Max/Min ratio = 1,540.6), regardless of the minimum window size. This setting still ensures that no single phase of the clustering process dominates the detection of SFSes. Overall, our parameter study reveals that achieving uniform semantic shift distribution depends primarily on the standard deviation multiplier rather than the minimum window size configuration. The multiplier of 3.0 effectively prevents later iterations from dominating the analysis while maintaining sensitivity to meaningful semantic changes throughout the clustering process.

Appendix E More Analysis on Hierarchical Structure

We explore the differences between the hierarchical semantic structures identified by SAFARI in embedding spaces and the more intuitive hierarchies found in natural human language. In human language, semantics typically follow a logical hierarchy, progressing from specific, concrete entities to more abstract concepts, much like an ontology. However, in embedding spaces, this progression is not always intuitive. The distinction between specific and abstract depends more on the data and model than on human reasoning, often leading to groupings that diverge from what we would expect based on natural language understanding.

For example, as illustrated in Figure 7, USA basketball teams are first grouped with USA football teams, and later, sports teams from various locations are merged, as shown in Figure 8. This follows a logical hierarchical structure, from more specific categories to broader ones. Yet, as shown in Figure B6 (at iteration 19,790), entities such as horse racing clubs, companies, and events (e.g., Jockey Club) are merged with famous racing horses. This merging of horse racing happens thousands of iterations after the merging of football and basketball teams in the USA. Following the ontology-like progression, we would expect more abstract concepts. However, horse racing is not a more abstract concept compared with other sports.

These examples illustrate that the hierarchical structures emerging from embedding spaces are governed by the model’s learned representations rather than human-designed logic. While some align intuitively with natural semantic categories, others can be surprising–revealing how models encode relationships that reflect statistical regularities in the data rather than explicit reasoning. This underscores the need for cautious interpretation: semantic hierarchies derived from embeddings may not faithfully mirror human conceptual structures and should be analyzed with an awareness of the model’s inductive biases.