Skip to content

dimension drops to 0 when number of samples is high #6

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
EtienneDavid opened this issue Jun 3, 2021 · 4 comments
Closed

dimension drops to 0 when number of samples is high #6

EtienneDavid opened this issue Jun 3, 2021 · 4 comments

Comments

@EtienneDavid
Copy link

Hello and thanks for the package !

I am currently running few tests and when I run estimator on large datasets, the returned dimension is 0. When I lower the number of sample, it returns a higher dimension. Any ideas of why this behavior happens ?

Best,

Etienne

@j-bac
Copy link
Collaborator

j-bac commented Jun 3, 2021

Could you share some code to reproduce this ? And which estimator has this issue ?

@karthikviswanathn
Copy link
Contributor

I observed a similar issue where having outliers decreased the estimated intrinsic dimension. Perhaps this happens because outliers can increase the highest eigenvalues and lead to a lower dimension estimate.

Here's a minimal example using swissRoll3Sph:

lPCA sensitive to single outlier – minimal reproducible example

import numpy as np
from skdim.datasets import swissRoll3Sph
from skdim.id import lPCA

data2 = swissRoll3Sph(n_swiss=4000, n_sphere=2000, h=2, random_state=0)

# Clean ID estimate
lpca = lPCA()
print("Clean ID:", lpca.fit(data2).dimension_) # Clean ID: 3

# Add single distant outlier
data_corrupt = np.vstack([data2, np.array([[1000, 0, 0, 0]])])
print("Corrupt ID:", lPCA().fit(data_corrupt).dimension_) # Corrupt ID: 1

@j-bac
Copy link
Collaborator

j-bac commented Apr 30, 2025

Thanks for reporting. One possibility would be to have an option to use robust PCA

With that said, lPCA has a bit of a confusing API. I kept it as a GlobalEstimator which means .fit applies PCA to the entire dataset, because this is what most people do when using PCA.

But the original paper referenced uses it as a local PCA applied in kNN around each point (.fit_pw). These can then be averaged to get a global estimate, like done for LocalEstimator class, and is likely much more stable to these kind of issues. See https://scikit-dimension.readthedocs.io/en/latest/basics.html

@karthikviswanathn
Copy link
Contributor

karthikviswanathn commented Apr 30, 2025

Thanks for the clarification, and for pointing to the .fit_pw usage, it is likely to be more robust!

One final thought: when aggregating the pointwise ID estimates, maybe using the mean might not always be the best choice. Perhaps, using cover sets (Algorithm 2 in [Fan2010]) yields a more reliable (and computationally efficient) global estimate.

Thanks again for your time and for maintaining this very useful package!

@j-bac j-bac closed this as completed Apr 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants