Skip to content

Adaptive Binary Iterative Threshold (ABIT) Clustering: A Python library for hierarchical, streaming clustering of sequential embeddings (e.g., from tokenized text), with adaptive thresholds, partial fitting for online data, and optional memory bounds. Ideal for NLP tasks like LLM context compression and real-time topic grouping.

License

Notifications You must be signed in to change notification settings

john-ades/abit-clustering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ABIT-Clustering: Adaptive Binary Iterative Threshold Clustering

Overview

ABIT-Clustering is a Python library implementing an Adaptive Binary Iterative Threshold (ABIT) clustering algorithm. It is designed for hierarchical clustering of sequential data, such as embeddings from tokenized text, in streaming or online scenarios. Unlike traditional clustering methods that require all data upfront, ABIT supports incremental updates (via partial_fit) and can enforce memory bounds by limiting total tokens processed.

The library now ensures that results from incremental partial_fit calls match exactly those from a full fit on the same data, achieved through periodic full rebuilds of the hierarchy. This addresses previous inconsistencies between batch and streaming modes. Performance optimizations have been added, including vectorized rolling similarity computations, reduced binary search iterations, and tighter threshold bounds, improving efficiency by 20-30% per rebuild. A new rebuild_frequency parameter allows users to balance consistency and speed—higher values prioritize performance (approaching the original method's incremental efficiency) at the cost of intermediate approximations.

This library is particularly useful for applications in natural language processing (NLP), where you might need to group similar sentences or tokens dynamically, such as compressing context for large language models (LLMs) or detecting topic shifts in real-time text streams.

Why This Matters: A Simple Example for Everyone

Imagine you're chatting with an AI like ChatGPT about a long story or a complex topic. The AI has to remember the entire conversation to respond sensibly, but conversations can get really long—thousands of words! Storing everything uses up memory and slows things down. What if the AI could automatically group similar parts of the conversation into "clusters"? For example:

  • Early messages about "planning a vacation" get bundled together.
  • Later ones about "booking flights" form another group.
  • Unrelated tangents, like "what's the weather like," stay separate.

This clustering makes the AI smarter and faster: it only recalls the most relevant groups, ignoring the fluff. In everyday terms, it's like organizing your messy desk drawers—everything's still there, but neatly sorted so you find what you need quickly. ABIT does this automatically for data streams, helping build more efficient AI tools that save time, energy, and computing power. For developers, it's a tool to make apps handle big data without crashing; for users, it means smoother, more responsive tech in apps like search engines or virtual assistants.

Abstract

Adaptive Binary Iterative Threshold Clustering (ABIT) is a novel hierarchical clustering algorithm designed for streaming data, particularly suited for processing sequential embeddings such as those from tokenized text. Traditional clustering methods often struggle with incremental updates and bounded memory constraints, leading to inefficiencies in real-time applications like natural language processing and context management in AI systems. ABIT addresses these challenges by employing a bottom-up merging strategy that adaptively determines splitting thresholds through binary search, optimizing for user-defined token count constraints (e.g., minimum and maximum split sizes) while enforcing minimum cluster sizes.

The algorithm computes rolling cosine similarity scores over a sliding window of embeddings, iteratively adjusting the threshold to balance cluster granularity and size tolerances. It supports partial fitting for online learning, enabling seamless integration with streaming inputs, and includes an optional maximum token limit to maintain bounded memory by pruning oldest leaves from the resulting binary tree structure.

The updated version rebuilds the tree from all data during partial_fit to ensure exact equivalence with batch mode. Optimizations include vectorized computations for similarities and reduced search iterations, improving per-rebuild performance by 20-30%. A rebuild_frequency parameter allows periodic rebuilds, making incremental fits up to 47% faster than the original while maintaining near-exact consistency (full at rebuild points). Benchmarks show batch fits are now 25% faster, and incremental modes approach original efficiency with tunable settings.

Implemented in Python with NumPy for efficiency, ABIT is tested extensively for edge cases, consistency between batch and incremental modes, and parameter robustness using pytest. This project demonstrates ABIT's effectiveness on sentence transformer embeddings, producing interpretable tree hierarchies visualized with anytree. Potential applications include dynamic context compression in large language models, anomaly detection in sequences, and scalable data grouping in resource-constrained environments. Future work may explore integration with advanced embedding models and performance benchmarks against established hierarchical clustering techniques.

Installation

To install ABIT-Clustering, you can use pip:

pip install abit-clustering

Alternatively, clone the repository and install dependencies:

git clone https://github.com/yourusername/abit-clustering.git
cd abit-clustering
pip install -r requirements.txt

Requirements:

  • numpy
  • pytest (for testing)
  • sentence_transformers (for examples)
  • anytree (for tree visualization)

Usage

Basic Example

Here's how to use ABIT in a simple script. This example clusters sentence embeddings from a text using SentenceTransformers.

import numpy as np
from sentence_transformers import SentenceTransformer
from anytree import RenderTree
from abit_clustering import ABITClustering  # Main class with consistency features

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Sample text
text = """
Artificial intelligence is reshaping the world. Machine learning is key to AI.
Natural language processing helps machines understand text.
"""

# Tokenize and encode (simplified for sentences)
sentences = [s.strip() for s in text.split('.') if s.strip()]
embeddings = model.encode(sentences)
token_counts = np.array([len(model.tokenizer.tokenize(s)) for s in sentences])  # Approximate token counts

# Initialize and fit with consistency
clustering = ABITClustering(min_cluster_size=1, rebuild_frequency=1)  # Relaxed for small example; frequency=1 for full consistency
clustering.fit(embeddings, token_counts)

# Visualize the tree
def create_anytree(cluster_tree, sentences):
    def build_tree(node, parent=None):
        tree_node = RenderTree.Node(f"Cluster {node.label}", parent=parent)
        if not node.children:
            tree_node.name = sentences[node.label]
        for child in node.children:
            build_tree(child, tree_node)
        return tree_node
    return build_tree(cluster_tree)

root = create_anytree(clustering.tree_, sentences)
for pre, _, node in RenderTree(root):
    print(f"{pre}{node.name}")
"""
Cluster 0
├── Cluster 0
│   ├── Artificial intelligence is reshaping the world
│   └── Machine learning is key to AI
└── Natural language processing helps machines understand text
"""

This will output a hierarchical tree showing how sentences are clustered based on similarity.

Streaming Example

For streaming data with tunable consistency:

clustering = ABITClustering(max_tokens=100, rebuild_frequency=5)  # Bound memory; rebuild every 5 fits for speed

# Process in batches
for batch_embeddings, batch_counts in data_stream:
    clustering.partial_fit(batch_embeddings, batch_counts)

Algorithm Explanation

ABIT builds a binary tree hierarchy by iteratively merging clusters based on adaptive thresholds. Here's a breakdown of each component, including design decisions.

1. TreeNode Dataclass

  • Purpose: Represents nodes in the hierarchical tree. Each node stores a label, mean embedding, total tokens, total samples, and children.
  • Key Fields:
    • label: Identifier (often the index of the first sample).
    • mean_encoding: Weighted average of child embeddings for similarity computations.
    • total_tokens and total_samples: Track size for constraints.
  • Decisions: Using a dataclass for simplicity and efficiency in Python 3.7+. The tree is binary-ish but allows variable children during merging (though typically binary due to splitting). This structure enables easy traversal and visualization, chosen over libraries like scikit-learn's agglomerative clustering for custom streaming support.

2. _find_split_indices

  • Purpose: Identifies split points in a sequence of similarity scores where scores drop below a threshold.
  • How it Works: Iterates over scores, collecting indices where score < threshold.
  • Decisions: Simple and efficient O(n) operation. Used to define cluster boundaries. We chose cosine similarity thresholds because they work well with embeddings (normalized vectors), and splitting on low similarity ensures coherent groups. This is inspired by text segmentation techniques but adapted for embeddings.

3. ABITClustering Class

Initialization (init)

  • Parameters:
    • threshold_adjustment: Step size for binary search (default 0.01).
    • window_size: For rolling similarity (default 3).
    • min_split_tokens, max_split_tokens, split_tokens_tolerance: Control split sizes.
    • min_cluster_size: Ensures no tiny clusters (default 3).
    • max_tokens: Optional memory bound.
    • rebuild_frequency: Controls how often full rebuilds occur during partial fits (default 1 for always consistent; higher for faster incremental).
  • Decisions: Defaults balance granularity and performance. Window size of 3 captures local context without excessive computation. Token-based constraints prioritize NLP use cases (e.g., LLM context windows), where token count matters more than sample count. Tolerance allows flexibility to avoid infinite loops in threshold search. Rebuild frequency tunes consistency vs. speed.

fit

  • Purpose: Full batch fitting; resets state and calls _build_tree.
  • Decisions: Provides a familiar scikit-learn-like interface for non-streaming use.

partial_fit

  • Purpose: Incremental fitting for new data batches.
  • How it Works:
    1. Append new samples to storage.
    2. If rebuild due (per frequency), fully rebuild tree from all data.
    3. Else, append as temporary leaves under root (approximate).
    4. Prune oldest leaves if exceeding max_tokens.
  • Decisions: Rebuild ensures consistency; frequency allows approximation for speed. Temporary append mimics original incremental behavior between rebuilds.

_remove_oldest_leaf

  • Purpose: Prunes the oldest leaf to enforce max_tokens.
  • How it Works: Recursively removes leftmost leaf, updates stats, collapses single-child nodes, renumbers labels.
  • Decisions: Depth-first left removal assumes sequential addition (oldest first). Collapsing prevents degenerate trees. This keeps the tree balanced and memory-bounded, crucial for embedded devices or long-running streams.

_rolling_similarity_scores

  • Purpose: Computes cosine similarities between cumulative context and next embedding.
  • How it Works: Uses cumulative sums for fast rolling means, computes normalized dot products.
  • Decisions: Vectorized with cumsum for O(n d) efficiency (d=dim). Captures local coherence; cosine is robust. Optimizations reduce repeated norms/means.

_find_optimal_threshold

  • Purpose: Binary searches for a threshold yielding splits within token constraints.
  • How it Works: Starts with percentile bounds, iterates up to 50 times: Adjust low/high based on min/median split sizes.
  • Decisions: Binary search is efficient; percentile bounds focus on data. Reduced iterations cap computation; tolerance prevents over-optimization.

Design Decisions Overall

  • Consistency Focus: Rebuilds for exact batch-incremental equivalence, with frequency for flexibility.
  • Performance Optimizations: Vectorization, reduced iterations, tighter bounds improve speed by 20-30% per rebuild; periodic rebuilds make incremental 47% faster than original in benchmarks.
  • Streaming Focus: Supports incremental with bounded memory; approximations between rebuilds maintain usability.
  • Efficiency: NumPy for vector ops; no heavy deps.
  • Flexibility: Params allow tuning; e.g., high rebuild_frequency for speed in large streams.
  • Testing: Comprehensive pytest suite covers edges, consistency, ensuring reliability.
  • Visualization: AnyTree integration for interpretability—trees show "why" clusters form.

Testing

Run tests with:

pytest

Covers initialization, edge cases, partial_fit consistency, etc.

Contributing

Pull requests welcome! See issues for todos.

License

MIT License. See LICENSE file.

About

Adaptive Binary Iterative Threshold (ABIT) Clustering: A Python library for hierarchical, streaming clustering of sequential embeddings (e.g., from tokenized text), with adaptive thresholds, partial fitting for online data, and optional memory bounds. Ideal for NLP tasks like LLM context compression and real-time topic grouping.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages