dimension drops to 0 when number of samples is high #6

EtienneDavid · 2021-06-03T11:42:27Z

Hello and thanks for the package !

I am currently running few tests and when I run estimator on large datasets, the returned dimension is 0. When I lower the number of sample, it returns a higher dimension. Any ideas of why this behavior happens ?

Best,

Etienne

j-bac · 2021-06-03T11:45:04Z

Could you share some code to reproduce this ? And which estimator has this issue ?

karthikviswanathn · 2025-04-30T11:03:16Z

I observed a similar issue where having outliers decreased the estimated intrinsic dimension. Perhaps this happens because outliers can increase the highest eigenvalues and lead to a lower dimension estimate.

Here's a minimal example using swissRoll3Sph:

`lPCA` sensitive to single outlier – minimal reproducible example

import numpy as np
from skdim.datasets import swissRoll3Sph
from skdim.id import lPCA

data2 = swissRoll3Sph(n_swiss=4000, n_sphere=2000, h=2, random_state=0)

# Clean ID estimate
lpca = lPCA()
print("Clean ID:", lpca.fit(data2).dimension_) # Clean ID: 3

# Add single distant outlier
data_corrupt = np.vstack([data2, np.array([[1000, 0, 0, 0]])])
print("Corrupt ID:", lPCA().fit(data_corrupt).dimension_) # Corrupt ID: 1

j-bac · 2025-04-30T16:23:42Z

Thanks for reporting. One possibility would be to have an option to use robust PCA

With that said, lPCA has a bit of a confusing API. I kept it as a GlobalEstimator which means .fit applies PCA to the entire dataset, because this is what most people do when using PCA.

But the original paper referenced uses it as a local PCA applied in kNN around each point (.fit_pw). These can then be averaged to get a global estimate, like done for LocalEstimator class, and is likely much more stable to these kind of issues. See https://scikit-dimension.readthedocs.io/en/latest/basics.html

karthikviswanathn · 2025-04-30T17:56:59Z

Thanks for the clarification, and for pointing to the .fit_pw usage, it is likely to be more robust!

One final thought: when aggregating the pointwise ID estimates, maybe using the mean might not always be the best choice. Perhaps, using cover sets (Algorithm 2 in [Fan2010]) yields a more reliable (and computationally efficient) global estimate.

Thanks again for your time and for maintaining this very useful package!

j-bac closed this as completed Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dimension drops to 0 when number of samples is high #6

dimension drops to 0 when number of samples is high #6

EtienneDavid commented Jun 3, 2021

j-bac commented Jun 3, 2021

karthikviswanathn commented Apr 30, 2025

j-bac commented Apr 30, 2025 •

edited

Loading

karthikviswanathn commented Apr 30, 2025 •

edited

Loading

dimension drops to 0 when number of samples is high #6

dimension drops to 0 when number of samples is high #6

Comments

EtienneDavid commented Jun 3, 2021

j-bac commented Jun 3, 2021

karthikviswanathn commented Apr 30, 2025

lPCA sensitive to single outlier – minimal reproducible example

j-bac commented Apr 30, 2025 • edited Loading

karthikviswanathn commented Apr 30, 2025 • edited Loading

`lPCA` sensitive to single outlier – minimal reproducible example

j-bac commented Apr 30, 2025 •

edited

Loading

karthikviswanathn commented Apr 30, 2025 •

edited

Loading