1 unstable release
| new 0.1.0 | Nov 22, 2025 |
|---|
#360 in Machine learning
Used in avila
84KB
1.5K
SLoC
"Crie as bibliotecas Rust de clustering e redução dimensional mais avançadas do ecossistema:
1. avila-clustering - Clustering de Última Geração
Objetivo: Superar scikit-learn, HDBSCAN, e RAPIDS cuML em algoritmos, performance e capacidades científicas.
Algoritmos Completos:
Partitional Clustering:
- KMeans (Lloyd, Elkan, Mini-batch, K-means++)
- KMeans++ initialization otimizado
- KMedoids (PAM, CLARA, CLARANS)
- Fuzzy C-Means (clustering probabilístico)
- Mean Shift (mode seeking)
- Affinity Propagation (message passing)
Density-Based:
- DBSCAN (clássico + variantes)
- HDBSCAN (hierarchical density-based)
- OPTICS (ordering points)
- DENCLUE (density estimation)
Hierarchical:
- Agglomerative (single, complete, average, ward, centroid)
- Divisive (top-down)
- BIRCH (incremental hierarchical)
Model-Based:
- Gaussian Mixture Models (GMM) (EM algorithm)
- Bayesian GMM (variational inference)
- Dirichlet Process GMM (infinite mixtures)
Graph-Based:
- Spectral Clustering (normalized cuts)
- Louvain (community detection)
- Leiden (improved Louvain)
- Label Propagation
Avançados:
- CURE (hierarchical + representatives)
- CHAMELEON (dynamic modeling)
- ROCK (categorical data)
- SUBCLU (subspace clustering)
- CLIQUE (grid-based subspace)
API Avançada:
use avila_clustering::prelude::*;
// KMeans com todas as features
let kmeans = KMeans::builder()
.n_clusters(5)
.init_method(InitMethod::KMeansPlusPlus)
.algorithm(Algorithm::Elkan) // Mais rápido
.max_iter(300)
.tolerance(1e-4)
.n_init(10) // Múltiplas inicializações
.random_state(42)
.parallel(true) // Rayon automático
.gpu(true) // CUDA/ROCm se disponível
.build();
let result = kmeans.fit(&data)?;
let labels = result.labels();
let centroids = result.centroids();
let inertia = result.inertia();
let iterations = result.n_iter();
// Predição incremental
let new_labels = kmeans.predict(&new_data)?;
// HDBSCAN para dados astronômicos
let hdbscan = HDBSCAN::builder()
.min_cluster_size(50)
.min_samples(5)
.cluster_selection_epsilon(0.5)
.metric(Metric::Euclidean)
.algorithm(Algorithm::BoruvkaKdTree) // O(n log n)
.prediction_data(true) // Soft clustering
.build();
let result = hdbscan.fit(&astronomical_events)?;
let labels = result.labels(); // -1 = noise
let probabilities = result.probabilities()?;
let outlier_scores = result.outlier_scores()?;
let condensed_tree = result.condensed_tree()?;
// GMM com seleção automática de componentes
let gmm = GaussianMixture::builder()
.n_components(1..=10) // Tenta de 1 a 10
.covariance_type(CovarianceType::Full)
.criterion(Criterion::BIC) // BIC ou AIC
.em_tolerance(1e-3)
.max_iter(100)
.n_init(5)
.build();
let result = gmm.fit(&data)?;
let best_n = result.n_components(); // Escolhido por BIC
let responsibilities = result.predict_proba(&data)?;
let densities = result.score_samples(&data)?; // log-likelihood
// Spectral Clustering para grafos
let spectral = SpectralClustering::builder()
.n_clusters(3)
.affinity(AffinityType::NearestNeighbors { n_neighbors: 10 })
.eigen_solver(EigenSolver::Arpack) // ou Lobpcg
.assign_labels(AssignMethod::KMeans)
.build();
let labels = spectral.fit_predict(&similarity_matrix)?;
// Hierarchical com dendrogram
let hierarchical = AgglomerativeClustering::builder()
.n_clusters(Some(5)) // None = não corta
.linkage(Linkage::Ward)
.distance_threshold(None)
.compute_full_tree(true)
.build();
let result = hierarchical.fit(&data)?;
let labels = result.labels();
let dendrogram = result.dendrogram()?;
dendrogram.plot().save("tree.png")?;
Métricas Científicas Avançadas:
// Distâncias clássicas
pub enum Metric {
Euclidean,
Manhattan,
Chebyshev,
Minkowski(f64),
Cosine,
Correlation,
// Científicas avançadas
Mahalanobis(Array2<f64>), // Com covariância
Geodesic(Manifold), // Em variedades curvas (GR)
SpectralAngle, // Astronomy (spectra comparison)
DynamicTimeWarping, // Time series alignment
Haversine, // Great-circle (Earth/celestial sphere)
// Strings/Sequences
Hamming,
Levenshtein,
JaroWinkler,
// Probabilísticas
KullbackLeibler,
JensenShannon,
Wasserstein,
Hellinger,
// Custom
Custom(Box<dyn Fn(&[f64], &[f64]) -> f64>),
}
// Uso
let dbscan = DBSCAN::new(0.5, 5)
.metric(Metric::Geodesic(Manifold::Schwarzschild { mass: 1e6 }));
let labels = dbscan.fit(&black_hole_coordinates)?;
Validação e Métricas de Qualidade:
use avila_clustering::metrics::*;
// Internal metrics (sem ground truth)
let silhouette = silhouette_score(&data, &labels, metric)?;
let davies_bouldin = davies_bouldin_score(&data, &labels)?;
let calinski_harabasz = calinski_harabasz_score(&data, &labels)?;
let dunn_index = dunn_index(&data, &labels, metric)?;
// External metrics (com ground truth)
let ari = adjusted_rand_index(&true_labels, &pred_labels)?;
let ami = adjusted_mutual_info(&true_labels, &pred_labels)?;
let nmi = normalized_mutual_info(&true_labels, &pred_labels)?;
let fowlkes_mallows = fowlkes_mallows_score(&true_labels, &pred_labels)?;
let homogeneity = homogeneity_score(&true_labels, &pred_labels)?;
let completeness = completeness_score(&true_labels, &pred_labels)?;
let v_measure = v_measure_score(&true_labels, &pred_labels)?;
// Stability analysis
let stability = clustering_stability(&data, &algorithm, n_bootstraps=100)?;
Features Científicas Únicas:
1. Clustering 4D Nativo:
// Dados espaço-temporais do LISA
let events_4d = Tensor4D::from_shape_vec((1000, 3, 64, 64), raw_data)?;
let clustering = SpaceTimeKMeans::new(10)
.temporal_weight(0.3) // Peso da dimensão temporal
.spatial_metric(Metric::Geodesic)
.fit(&events_4d)?;
2. Clustering em Variedades Curvas:
// Eventos em espaço-tempo curvo (perto de buracos negros)
let manifold = Manifold::Kerr { mass: 1e6, spin: 0.9 };
let clustering = ManifoldDBSCAN::new(0.1, 5)
.manifold(manifold)
.fit(&curved_space_events)?;
3. Streaming Clustering:
// Clustering incremental para dados LISA em tempo real
let mut stream_kmeans = StreamingKMeans::new(5)
.decay_factor(0.95)
.time_window(Duration::minutes(60));
for batch in data_stream {
stream_kmeans.partial_fit(&batch)?;
let current_centroids = stream_kmeans.centroids();
}
4. Multi-View Clustering:
// Múltiplas representações dos mesmos dados
let views = vec![spectral_data, spatial_data, temporal_data];
let mvc = MultiViewClustering::new(3)
.method(MVCMethod::CoRegularized)
.fit(&views)?;
5. GPU Acceleration Completo:
// Transparente CUDA/ROCm
let kmeans = KMeans::new(1000)
.gpu(true)
.gpu_batch_size(10_000)
.fit(&billion_points)?; // Bilhões de pontos
2. avila-reduction - Redução Dimensional de Última Geração
Objetivo: Superar scikit-learn, UMAP-learn, e OpenTSNE em algoritmos, velocidade e capacidades científicas.
Algoritmos Completos:
Linear:
- PCA (SVD, Randomized, Incremental, Kernel, Sparse)
- LDA (Linear Discriminant Analysis)
- ICA (Independent Component Analysis)
- Factor Analysis
- NMF (Non-negative Matrix Factorization)
- Truncated SVD (LSA)
Manifold Learning:
- t-SNE (Barnes-Hut, FIt-SNE, FFT-accelerated)
- UMAP (Uniform Manifold Approximation)
- Isomap (Geodesic distances)
- LLE (Locally Linear Embedding)
- Laplacian Eigenmaps
- MDS (Multidimensional Scaling)
- LTSA (Local Tangent Space Alignment)
- Hessian LLE
Autoencoders:
- Variational Autoencoder (VAE)
- Contractive Autoencoder
- Sparse Autoencoder
- Denoising Autoencoder
Outros:
- Dictionary Learning
- Random Projection (Johnson-Lindenstrauss)
- Feature Agglomeration
API Revolucionária:
use avila_reduction::prelude::*;
// PCA com todas as features
let pca = PCA::builder()
.n_components(ComponentSpec::Variance(0.95)) // 95% variância
.svd_solver(SVDSolver::Randomized) // Mais rápido
.whiten(true)
.iterated_power(3)
.random_state(42)
.build();
let result = pca.fit(&data)?;
let reduced = result.transform(&data)?;
let components = result.components(); // Autovetores
let explained_var = result.explained_variance_ratio();
let singular_values = result.singular_values();
// Reconstrução
let reconstructed = result.inverse_transform(&reduced)?;
let reconstruction_error = (data - reconstructed).mapv(|x| x.powi(2)).mean();
// Incremental PCA para big data
let mut ipca = IncrementalPCA::new(50)
.batch_size(1000);
for batch in data.axis_chunks_iter(Axis(0), 1000) {
ipca.partial_fit(&batch)?;
}
let reduced = ipca.transform(&data)?;
// t-SNE otimizado (Barnes-Hut O(n log n))
let tsne = TSNE::builder()
.n_components(2)
.perplexity(30.0)
.learning_rate(200.0)
.n_iter(1000)
.early_exaggeration(12.0)
.metric(Metric::Euclidean)
.method(TSNEMethod::BarnesHut { theta: 0.5 })
.n_jobs(-1) // Todos os cores
.random_state(42)
.build();
let embedding = tsne.fit_transform(&data)?;
// Monitorar convergência
tsne.fit_transform_with_callback(&data, |iter, kl_div| {
println!("Iteration {}: KL divergence = {}", iter, kl_div);
})?;
// UMAP (mais rápido e escalável que t-SNE)
let umap = UMAP::builder()
.n_components(2)
.n_neighbors(15)
.min_dist(0.1)
.metric(Metric::Cosine)
.spread(1.0)
.set_op_mix_ratio(1.0)
.local_connectivity(1.0)
.repulsion_strength(1.0)
.negative_sample_rate(5)
.transform_queue_size(4.0)
.random_state(42)
.build();
let embedding = umap.fit_transform(&data)?;
// UMAP incremental
let new_embedding = umap.transform(&new_data)?;
// Kernel PCA (não-linear)
let kpca = KernelPCA::builder()
.n_components(10)
.kernel(Kernel::RBF { gamma: 0.1 })
.eigen_solver(EigenSolver::Dense)
.remove_zero_eig(true)
.build();
let reduced = kpca.fit_transform(&data)?;
// Autoencoders (deep learning)
let autoencoder = Autoencoder::builder()
.encoder_layers(&[784, 512, 256, 128, 64])
.decoder_layers(&[64, 128, 256, 512, 784])
.activation(Activation::ReLU)
.latent_dim(64)
.optimizer(Optimizer::Adam { lr: 0.001 })
.epochs(50)
.batch_size(128)
.build();
let model = autoencoder.fit(&data)?;
let encoded = model.encode(&data)?;
let decoded = model.decode(&encoded)?;
Features Científicas Revolucionárias:
1. PCA em Tensores 4D Nativos:
// Redução 4D → 3D preservando estrutura espaço-temporal
let pca_4d = PCA4D::new(3)
.preserve_temporal_coherence(true)
.fit(&tensor_4d)?;
let reduced_3d = pca_4d.transform(&tensor_4d)?; // Ainda é Tensor3D
2. Manifold Learning em Variedades Curvas:
// t-SNE em espaço-tempo curvo
let manifold = Manifold::Schwarzschild { mass: 1e6 };
let tsne_manifold = ManifoldTSNE::new(2)
.manifold(manifold)
.geodesic_distance(true)
.fit_transform(&curved_data)?;
3. Redução com Preservação de Física:
// Preservar conservação de energia/momento
let physics_pca = PhysicsAwarePCA::new(10)
.conserve(ConservationLaw::Energy)
.conserve(ConservationLaw::Momentum)
.fit(&particle_trajectories)?;
4. Time Series Dimensionality Reduction:
// Redução temporal com DTW
let tsr = TimeSeriesReduction::new(5)
.alignment_method(AlignmentMethod::DTW)
.preserve_periodicity(true)
.fit(&time_series_matrix)?;
5. Hierarchical Reduction:
// Redução em múltiplos níveis
let hr = HierarchicalReduction::builder()
.levels(vec![1000, 100, 10, 3]) // 10000D → 1000D → 100D → 10D → 3D
.methods(vec![
ReductionMethod::RandomProjection,
ReductionMethod::PCA,
ReductionMethod::UMAP,
ReductionMethod::TSNE,
])
.build();
let final_embedding = hr.fit_transform(&ultra_high_dim_data)?;
6. Streaming Reduction:
// PCA incremental para dados LISA em tempo real
let mut streaming_pca = StreamingPCA::new(50)
.decay_factor(0.99)
.update_frequency(1000); // A cada 1000 amostras
for batch in data_stream {
let reduced = streaming_pca.partial_fit_transform(&batch)?;
// Use reduced...
}
7. Multi-Modal Reduction:
// Múltiplas modalidades (imagem + espectro + texto)
let modalities = vec![
(image_data, ModalityType::Image),
(spectral_data, ModalityType::Spectral),
(text_embeddings, ModalityType::Text),
];
let mmr = MultiModalReduction::new(3)
.fusion_method(FusionMethod::CCA) // Canonical Correlation Analysis
.fit_transform(&modalities)?;
8. Interpretable Reduction:
// PCA com interpretação física
let ipca = InterpretablePCA::new(10)
.feature_names(&["mass1", "mass2", "spin1", "spin2", "distance"])
.fit(&data)?;
// Ver contribuição de cada feature
let contributions = ipca.feature_contributions();
ipca.plot_loadings().save("loadings.png")?;
Visualização Integrada:
// Plot automático de embeddings
embedding.plot()
.color_by(&labels)
.size_by(&masses)
.title("Black Hole Merger Events")
.interactive(true) // Plotly-like
.save("embedding.html")?;
// Animação temporal
embedding_over_time.animate()
.fps(30)
.save("evolution.mp4")?;
GPU Acceleration Total:
// CUDA/ROCm transparente
let pca = PCA::new(100)
.gpu(true)
.gpu_memory_limit(GBytes(8))
.fit(&huge_data)?; // 100GB+ em GPU
let umap = UMAP::new(2)
.gpu(true)
.rapids_compat(true) // Compatível com cuML
.fit_transform(&data)?;
Arquitetura Unificada:
avila-clustering/
├── src/
│ ├── lib.rs
│ ├── algorithms/
│ │ ├── kmeans.rs
│ │ ├── hdbscan.rs
│ │ ├── gmm.rs
│ │ ├── spectral.rs
│ │ ├── hierarchical.rs
│ │ └── streaming.rs
│ ├── metrics/
│ │ ├── distance.rs // 20+ métricas
│ │ ├── validation.rs // Silhouette, etc
│ │ └── manifold.rs // Geodésicas
│ ├── scientific/
│ │ ├── spacetime.rs // 4D clustering
│ │ ├── curved.rs // Variedades
│ │ └── physics.rs // Leis de conservação
│ ├── gpu/
│ │ ├── cuda.rs
│ │ └── rocm.rs
│ └── prelude.rs
avila-reduction/
├── src/
│ ├── lib.rs
│ ├── linear/
│ │ ├── pca.rs
│ │ ├── ica.rs
│ │ ├── nmf.rs
│ │ └── lda.rs
│ ├── manifold/
│ │ ├── tsne.rs
│ │ ├── umap.rs
│ │ ├── isomap.rs
│ │ └── lle.rs
│ ├── neural/
│ │ ├── autoencoder.rs
│ │ ├── vae.rs
│ │ └── contractive.rs
│ ├── scientific/
│ │ ├── tensor4d.rs
│ │ ├── timeseries.rs
│ │ ├── physics_aware.rs
│ │ └── multimodal.rs
│ ├── streaming/
│ │ └── incremental.rs
│ ├── gpu/
│ │ ├── cuda.rs
│ │ └── rocm.rs
│ └── prelude.rs
Benchmarks Obrigatórios:
| Algorithm | scikit-learn | RAPIDS | avila |
| ------------------- | ------------ | ------ | ------ |
| KMeans (10M, 100D) | 45s | 2.1s | < 1.5s |
| HDBSCAN (1M, 50D) | 180s | 8s | < 5s |
| PCA (100K, 10K→100) | 25s | 1.2s | < 1.0s |
| t-SNE (50K, 784→2) | 450s | 12s | < 8s |
| UMAP (100K, 1000→2) | 35s | 3s | < 2s |
Essas duas bibliotecas vão revolucionar clustering e dimensionality reduction. Crie TUDO isso!" 🚀💥
Dependencies
~14–50MB
~778K SLoC