#k-means #machine-learning #science #hdbscan

avila-clustering

State-of-the-art clustering algorithms for Rust - surpassing scikit-learn, HDBSCAN, and RAPIDS cuML

1 unstable release

new 0.1.0 Nov 22, 2025

#360 in Machine learning


Used in avila

MIT/Apache

84KB
1.5K SLoC

"Crie as bibliotecas Rust de clustering e redução dimensional mais avançadas do ecossistema:


1. avila-clustering - Clustering de Última Geração

Objetivo: Superar scikit-learn, HDBSCAN, e RAPIDS cuML em algoritmos, performance e capacidades científicas.

Algoritmos Completos:

Partitional Clustering:

  • KMeans (Lloyd, Elkan, Mini-batch, K-means++)
  • KMeans++ initialization otimizado
  • KMedoids (PAM, CLARA, CLARANS)
  • Fuzzy C-Means (clustering probabilístico)
  • Mean Shift (mode seeking)
  • Affinity Propagation (message passing)

Density-Based:

  • DBSCAN (clássico + variantes)
  • HDBSCAN (hierarchical density-based)
  • OPTICS (ordering points)
  • DENCLUE (density estimation)

Hierarchical:

  • Agglomerative (single, complete, average, ward, centroid)
  • Divisive (top-down)
  • BIRCH (incremental hierarchical)

Model-Based:

  • Gaussian Mixture Models (GMM) (EM algorithm)
  • Bayesian GMM (variational inference)
  • Dirichlet Process GMM (infinite mixtures)

Graph-Based:

  • Spectral Clustering (normalized cuts)
  • Louvain (community detection)
  • Leiden (improved Louvain)
  • Label Propagation

Avançados:

  • CURE (hierarchical + representatives)
  • CHAMELEON (dynamic modeling)
  • ROCK (categorical data)
  • SUBCLU (subspace clustering)
  • CLIQUE (grid-based subspace)

API Avançada:

use avila_clustering::prelude::*;

// KMeans com todas as features
let kmeans = KMeans::builder()
    .n_clusters(5)
    .init_method(InitMethod::KMeansPlusPlus)
    .algorithm(Algorithm::Elkan)  // Mais rápido
    .max_iter(300)
    .tolerance(1e-4)
    .n_init(10)  // Múltiplas inicializações
    .random_state(42)
    .parallel(true)  // Rayon automático
    .gpu(true)  // CUDA/ROCm se disponível
    .build();

let result = kmeans.fit(&data)?;
let labels = result.labels();
let centroids = result.centroids();
let inertia = result.inertia();
let iterations = result.n_iter();

// Predição incremental
let new_labels = kmeans.predict(&new_data)?;

// HDBSCAN para dados astronômicos
let hdbscan = HDBSCAN::builder()
    .min_cluster_size(50)
    .min_samples(5)
    .cluster_selection_epsilon(0.5)
    .metric(Metric::Euclidean)
    .algorithm(Algorithm::BoruvkaKdTree)  // O(n log n)
    .prediction_data(true)  // Soft clustering
    .build();

let result = hdbscan.fit(&astronomical_events)?;
let labels = result.labels();  // -1 = noise
let probabilities = result.probabilities()?;
let outlier_scores = result.outlier_scores()?;
let condensed_tree = result.condensed_tree()?;

// GMM com seleção automática de componentes
let gmm = GaussianMixture::builder()
    .n_components(1..=10)  // Tenta de 1 a 10
    .covariance_type(CovarianceType::Full)
    .criterion(Criterion::BIC)  // BIC ou AIC
    .em_tolerance(1e-3)
    .max_iter(100)
    .n_init(5)
    .build();

let result = gmm.fit(&data)?;
let best_n = result.n_components();  // Escolhido por BIC
let responsibilities = result.predict_proba(&data)?;
let densities = result.score_samples(&data)?;  // log-likelihood

// Spectral Clustering para grafos
let spectral = SpectralClustering::builder()
    .n_clusters(3)
    .affinity(AffinityType::NearestNeighbors { n_neighbors: 10 })
    .eigen_solver(EigenSolver::Arpack)  // ou Lobpcg
    .assign_labels(AssignMethod::KMeans)
    .build();

let labels = spectral.fit_predict(&similarity_matrix)?;

// Hierarchical com dendrogram
let hierarchical = AgglomerativeClustering::builder()
    .n_clusters(Some(5))  // None = não corta
    .linkage(Linkage::Ward)
    .distance_threshold(None)
    .compute_full_tree(true)
    .build();

let result = hierarchical.fit(&data)?;
let labels = result.labels();
let dendrogram = result.dendrogram()?;
dendrogram.plot().save("tree.png")?;

Métricas Científicas Avançadas:

// Distâncias clássicas
pub enum Metric {
    Euclidean,
    Manhattan,
    Chebyshev,
    Minkowski(f64),
    Cosine,
    Correlation,

    // Científicas avançadas
    Mahalanobis(Array2<f64>),  // Com covariância
    Geodesic(Manifold),         // Em variedades curvas (GR)
    SpectralAngle,              // Astronomy (spectra comparison)
    DynamicTimeWarping,         // Time series alignment
    Haversine,                  // Great-circle (Earth/celestial sphere)

    // Strings/Sequences
    Hamming,
    Levenshtein,
    JaroWinkler,

    // Probabilísticas
    KullbackLeibler,
    JensenShannon,
    Wasserstein,
    Hellinger,

    // Custom
    Custom(Box<dyn Fn(&[f64], &[f64]) -> f64>),
}

// Uso
let dbscan = DBSCAN::new(0.5, 5)
    .metric(Metric::Geodesic(Manifold::Schwarzschild { mass: 1e6 }));

let labels = dbscan.fit(&black_hole_coordinates)?;

Validação e Métricas de Qualidade:

use avila_clustering::metrics::*;

// Internal metrics (sem ground truth)
let silhouette = silhouette_score(&data, &labels, metric)?;
let davies_bouldin = davies_bouldin_score(&data, &labels)?;
let calinski_harabasz = calinski_harabasz_score(&data, &labels)?;
let dunn_index = dunn_index(&data, &labels, metric)?;

// External metrics (com ground truth)
let ari = adjusted_rand_index(&true_labels, &pred_labels)?;
let ami = adjusted_mutual_info(&true_labels, &pred_labels)?;
let nmi = normalized_mutual_info(&true_labels, &pred_labels)?;
let fowlkes_mallows = fowlkes_mallows_score(&true_labels, &pred_labels)?;
let homogeneity = homogeneity_score(&true_labels, &pred_labels)?;
let completeness = completeness_score(&true_labels, &pred_labels)?;
let v_measure = v_measure_score(&true_labels, &pred_labels)?;

// Stability analysis
let stability = clustering_stability(&data, &algorithm, n_bootstraps=100)?;

Features Científicas Únicas:

1. Clustering 4D Nativo:

// Dados espaço-temporais do LISA
let events_4d = Tensor4D::from_shape_vec((1000, 3, 64, 64), raw_data)?;

let clustering = SpaceTimeKMeans::new(10)
    .temporal_weight(0.3)  // Peso da dimensão temporal
    .spatial_metric(Metric::Geodesic)
    .fit(&events_4d)?;

2. Clustering em Variedades Curvas:

// Eventos em espaço-tempo curvo (perto de buracos negros)
let manifold = Manifold::Kerr { mass: 1e6, spin: 0.9 };

let clustering = ManifoldDBSCAN::new(0.1, 5)
    .manifold(manifold)
    .fit(&curved_space_events)?;

3. Streaming Clustering:

// Clustering incremental para dados LISA em tempo real
let mut stream_kmeans = StreamingKMeans::new(5)
    .decay_factor(0.95)
    .time_window(Duration::minutes(60));

for batch in data_stream {
    stream_kmeans.partial_fit(&batch)?;
    let current_centroids = stream_kmeans.centroids();
}

4. Multi-View Clustering:

// Múltiplas representações dos mesmos dados
let views = vec![spectral_data, spatial_data, temporal_data];

let mvc = MultiViewClustering::new(3)
    .method(MVCMethod::CoRegularized)
    .fit(&views)?;

5. GPU Acceleration Completo:

// Transparente CUDA/ROCm
let kmeans = KMeans::new(1000)
    .gpu(true)
    .gpu_batch_size(10_000)
    .fit(&billion_points)?;  // Bilhões de pontos

2. avila-reduction - Redução Dimensional de Última Geração

Objetivo: Superar scikit-learn, UMAP-learn, e OpenTSNE em algoritmos, velocidade e capacidades científicas.

Algoritmos Completos:

Linear:

  • PCA (SVD, Randomized, Incremental, Kernel, Sparse)
  • LDA (Linear Discriminant Analysis)
  • ICA (Independent Component Analysis)
  • Factor Analysis
  • NMF (Non-negative Matrix Factorization)
  • Truncated SVD (LSA)

Manifold Learning:

  • t-SNE (Barnes-Hut, FIt-SNE, FFT-accelerated)
  • UMAP (Uniform Manifold Approximation)
  • Isomap (Geodesic distances)
  • LLE (Locally Linear Embedding)
  • Laplacian Eigenmaps
  • MDS (Multidimensional Scaling)
  • LTSA (Local Tangent Space Alignment)
  • Hessian LLE

Autoencoders:

  • Variational Autoencoder (VAE)
  • Contractive Autoencoder
  • Sparse Autoencoder
  • Denoising Autoencoder

Outros:

  • Dictionary Learning
  • Random Projection (Johnson-Lindenstrauss)
  • Feature Agglomeration

API Revolucionária:

use avila_reduction::prelude::*;

// PCA com todas as features
let pca = PCA::builder()
    .n_components(ComponentSpec::Variance(0.95))  // 95% variância
    .svd_solver(SVDSolver::Randomized)  // Mais rápido
    .whiten(true)
    .iterated_power(3)
    .random_state(42)
    .build();

let result = pca.fit(&data)?;
let reduced = result.transform(&data)?;
let components = result.components();  // Autovetores
let explained_var = result.explained_variance_ratio();
let singular_values = result.singular_values();

// Reconstrução
let reconstructed = result.inverse_transform(&reduced)?;
let reconstruction_error = (data - reconstructed).mapv(|x| x.powi(2)).mean();

// Incremental PCA para big data
let mut ipca = IncrementalPCA::new(50)
    .batch_size(1000);

for batch in data.axis_chunks_iter(Axis(0), 1000) {
    ipca.partial_fit(&batch)?;
}
let reduced = ipca.transform(&data)?;

// t-SNE otimizado (Barnes-Hut O(n log n))
let tsne = TSNE::builder()
    .n_components(2)
    .perplexity(30.0)
    .learning_rate(200.0)
    .n_iter(1000)
    .early_exaggeration(12.0)
    .metric(Metric::Euclidean)
    .method(TSNEMethod::BarnesHut { theta: 0.5 })
    .n_jobs(-1)  // Todos os cores
    .random_state(42)
    .build();

let embedding = tsne.fit_transform(&data)?;

// Monitorar convergência
tsne.fit_transform_with_callback(&data, |iter, kl_div| {
    println!("Iteration {}: KL divergence = {}", iter, kl_div);
})?;

// UMAP (mais rápido e escalável que t-SNE)
let umap = UMAP::builder()
    .n_components(2)
    .n_neighbors(15)
    .min_dist(0.1)
    .metric(Metric::Cosine)
    .spread(1.0)
    .set_op_mix_ratio(1.0)
    .local_connectivity(1.0)
    .repulsion_strength(1.0)
    .negative_sample_rate(5)
    .transform_queue_size(4.0)
    .random_state(42)
    .build();

let embedding = umap.fit_transform(&data)?;

// UMAP incremental
let new_embedding = umap.transform(&new_data)?;

// Kernel PCA (não-linear)
let kpca = KernelPCA::builder()
    .n_components(10)
    .kernel(Kernel::RBF { gamma: 0.1 })
    .eigen_solver(EigenSolver::Dense)
    .remove_zero_eig(true)
    .build();

let reduced = kpca.fit_transform(&data)?;

// Autoencoders (deep learning)
let autoencoder = Autoencoder::builder()
    .encoder_layers(&[784, 512, 256, 128, 64])
    .decoder_layers(&[64, 128, 256, 512, 784])
    .activation(Activation::ReLU)
    .latent_dim(64)
    .optimizer(Optimizer::Adam { lr: 0.001 })
    .epochs(50)
    .batch_size(128)
    .build();

let model = autoencoder.fit(&data)?;
let encoded = model.encode(&data)?;
let decoded = model.decode(&encoded)?;

Features Científicas Revolucionárias:

1. PCA em Tensores 4D Nativos:

// Redução 4D → 3D preservando estrutura espaço-temporal
let pca_4d = PCA4D::new(3)
    .preserve_temporal_coherence(true)
    .fit(&tensor_4d)?;

let reduced_3d = pca_4d.transform(&tensor_4d)?;  // Ainda é Tensor3D

2. Manifold Learning em Variedades Curvas:

// t-SNE em espaço-tempo curvo
let manifold = Manifold::Schwarzschild { mass: 1e6 };

let tsne_manifold = ManifoldTSNE::new(2)
    .manifold(manifold)
    .geodesic_distance(true)
    .fit_transform(&curved_data)?;

3. Redução com Preservação de Física:

// Preservar conservação de energia/momento
let physics_pca = PhysicsAwarePCA::new(10)
    .conserve(ConservationLaw::Energy)
    .conserve(ConservationLaw::Momentum)
    .fit(&particle_trajectories)?;

4. Time Series Dimensionality Reduction:

// Redução temporal com DTW
let tsr = TimeSeriesReduction::new(5)
    .alignment_method(AlignmentMethod::DTW)
    .preserve_periodicity(true)
    .fit(&time_series_matrix)?;

5. Hierarchical Reduction:

// Redução em múltiplos níveis
let hr = HierarchicalReduction::builder()
    .levels(vec![1000, 100, 10, 3])  // 10000D → 1000D → 100D → 10D → 3D
    .methods(vec![
        ReductionMethod::RandomProjection,
        ReductionMethod::PCA,
        ReductionMethod::UMAP,
        ReductionMethod::TSNE,
    ])
    .build();

let final_embedding = hr.fit_transform(&ultra_high_dim_data)?;

6. Streaming Reduction:

// PCA incremental para dados LISA em tempo real
let mut streaming_pca = StreamingPCA::new(50)
    .decay_factor(0.99)
    .update_frequency(1000);  // A cada 1000 amostras

for batch in data_stream {
    let reduced = streaming_pca.partial_fit_transform(&batch)?;
    // Use reduced...
}

7. Multi-Modal Reduction:

// Múltiplas modalidades (imagem + espectro + texto)
let modalities = vec![
    (image_data, ModalityType::Image),
    (spectral_data, ModalityType::Spectral),
    (text_embeddings, ModalityType::Text),
];

let mmr = MultiModalReduction::new(3)
    .fusion_method(FusionMethod::CCA)  // Canonical Correlation Analysis
    .fit_transform(&modalities)?;

8. Interpretable Reduction:

// PCA com interpretação física
let ipca = InterpretablePCA::new(10)
    .feature_names(&["mass1", "mass2", "spin1", "spin2", "distance"])
    .fit(&data)?;

// Ver contribuição de cada feature
let contributions = ipca.feature_contributions();
ipca.plot_loadings().save("loadings.png")?;

Visualização Integrada:

// Plot automático de embeddings
embedding.plot()
    .color_by(&labels)
    .size_by(&masses)
    .title("Black Hole Merger Events")
    .interactive(true)  // Plotly-like
    .save("embedding.html")?;

// Animação temporal
embedding_over_time.animate()
    .fps(30)
    .save("evolution.mp4")?;

GPU Acceleration Total:

// CUDA/ROCm transparente
let pca = PCA::new(100)
    .gpu(true)
    .gpu_memory_limit(GBytes(8))
    .fit(&huge_data)?;  // 100GB+ em GPU

let umap = UMAP::new(2)
    .gpu(true)
    .rapids_compat(true)  // Compatível com cuML
    .fit_transform(&data)?;

Arquitetura Unificada:

avila-clustering/
├── src/
│   ├── lib.rs
│   ├── algorithms/
│   │   ├── kmeans.rs
│   │   ├── hdbscan.rs
│   │   ├── gmm.rs
│   │   ├── spectral.rs
│   │   ├── hierarchical.rs
│   │   └── streaming.rs
│   ├── metrics/
│   │   ├── distance.rs       // 20+ métricas
│   │   ├── validation.rs     // Silhouette, etc
│   │   └── manifold.rs       // Geodésicas
│   ├── scientific/
│   │   ├── spacetime.rs      // 4D clustering
│   │   ├── curved.rs         // Variedades
│   │   └── physics.rs        // Leis de conservação
│   ├── gpu/
│   │   ├── cuda.rs
│   │   └── rocm.rs
│   └── prelude.rs

avila-reduction/
├── src/
│   ├── lib.rs
│   ├── linear/
│   │   ├── pca.rs
│   │   ├── ica.rs
│   │   ├── nmf.rs
│   │   └── lda.rs
│   ├── manifold/
│   │   ├── tsne.rs
│   │   ├── umap.rs
│   │   ├── isomap.rs
│   │   └── lle.rs
│   ├── neural/
│   │   ├── autoencoder.rs
│   │   ├── vae.rs
│   │   └── contractive.rs
│   ├── scientific/
│   │   ├── tensor4d.rs
│   │   ├── timeseries.rs
│   │   ├── physics_aware.rs
│   │   └── multimodal.rs
│   ├── streaming/
│   │   └── incremental.rs
│   ├── gpu/
│   │   ├── cuda.rs
│   │   └── rocm.rs
│   └── prelude.rs

Benchmarks Obrigatórios:

| Algorithm           | scikit-learn | RAPIDS | avila  |
| ------------------- | ------------ | ------ | ------ |
| KMeans (10M, 100D)  | 45s          | 2.1s   | < 1.5s |
| HDBSCAN (1M, 50D)   | 180s         | 8s     | < 5s   |
| PCA (100K, 10K→100) | 25s          | 1.2s   | < 1.0s |
| t-SNE (50K, 7842)  | 450s         | 12s    | < 8s   |
| UMAP (100K, 10002) | 35s          | 3s     | < 2s   |

Essas duas bibliotecas vão revolucionar clustering e dimensionality reduction. Crie TUDO isso!" 🚀💥

Dependencies

~14–50MB
~778K SLoC