# # Finding and Removing Mislabels
#
# [](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb)
# [](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/examples/finding-removing-mislabels.ipynb)
# [](https://visual-layer.readme.io/docs/finding-removing-mislabels)
#
# This notebook shows how to quickly analyze an image dataset for potential image mislabels and export the list of mislabeled images for further inspection.
# ## Installation
# First, let's start with the installation:
#
# > ✅ **Tip** - If you're new to fastdup, we encourage you to run the notebook in [Google Colab](https://colab.research.google.com/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb) or [Kaggle](https://kaggle.com/kernels/welcome?src=https://github.com/visual-layer/fastdup/blob/main/quick-dataset-analysis.ipynb) for the best experience. If you'd like to just view and skim through the notebook, we recommend viewing using [nbviewer](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb).
#
#
# In[ ]:
get_ipython().system('pip install fastdup -Uq')
# Now, test the installation by printing out the version. If there's no error message, we are ready to go!
# In[1]:
import fastdup
fastdup.__version__
# ## Download Dataset
#
#
# In this notebook let's use a widely available and relatively well curated [Food-101](https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/) dataset.
#
# The Food-101 dataset consists of 101 food classes with 1,000 images per class. That is a total of 101,000 images.
#
# Let's download only from the dataset and extract them into our local directory:
#
# > 🗒 **Note** - fastdup works on both unlabeled and labeled images. But for now, we are only interested in finding issues in the images and not the annotations.
# > If you're interested in finding annotation issues, head to:
# > + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb)
# > + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb).
#
#
# Let's download only from the dataset and extract them into the local directory:
# In[ ]:
get_ipython().system('wget http://data.vision.ee.ethz.ch/cvl/food-101.tar.gz')
get_ipython().system('tar -xf food-101.tar.gz')
# ## Create Annotations DataFrame
#
# food-101 dataset has a specific structure where the images are stored in folders named after the class name. Let's create a DataFrame with the annotations.
# In[2]:
import os
import pandas as pd
dataset_dir = 'food-101/images/'
filenames = []
labels = []
# Iterate over the directory and subdirectories
for root, dirs, files in os.walk(dataset_dir):
# Skip the root directory
if root == dataset_dir:
continue
label = os.path.basename(root)
for filename in files:
filenames.append(os.path.join(root, filename))
labels.append(label)
data = {'filename': filenames, 'label': labels}
df = pd.DataFrame(data)
df
# ## Run fastdup
#
# Once the extraction completes, we can run fastdup on the images.
#
# For that let's initialize fastdup and specify the input directory which points to the folder of images.
#
# > 🗒 **Note** - The `.create` method also has an optional `work_dir` parameter which specifies the directory to store artifacts from the run.
#
# In other words you can run `fastdup.create(input_dir="images/", work_dir="my_work_dir/")` if you'd like to store the artifacts in a `my_work_dir`.
#
# Now, let's run fastdup.
# In[ ]:
fd = fastdup.create(input_dir="food-101/images/")
fd.run(annotations=df)
# In[44]:
outliers_df = fd.outliers()
# In[45]:
outliers_df
# In[46]:
outliers_df = outliers_df[['filename_outlier', 'filename_nearest', 'distance', 'label_outlier', 'label_nearest']]
outliers_df
# Let's select the top 30 outliers and display them.
# In[47]:
outliers_df = outliers_df.head(30)
# In[48]:
import base64
from io import BytesIO
from PIL import Image
def resize_and_encode_image(image_path, width=100):
with Image.open(image_path) as img:
wpercent = (width / float(img.size[0]))
height = int((float(img.size[1]) * float(wpercent)))
resized_img = img.resize((width, height))
buffered = BytesIO()
resized_img.save(buffered, format="PNG")
encoded_string = base64.b64encode(buffered.getvalue()).decode('utf-8')
return f''
def display_image(image_path, width=100):
if isinstance(image_path, str):
return resize_and_encode_image(image_path, width)
else:
return ''
outliers_df['filename_outlier_preview'] = outliers_df['filename_outlier'].apply(lambda x: display_image(x, width=100))
outliers_df['filename_nearest_preview'] = outliers_df['filename_nearest'].apply(lambda x: display_image(x, width=100))
display(outliers_df.style)
# Now we can export the results to a CSV file for further analysis and correction of labels.
# In[49]:
outliers_df.drop(columns=['filename_outlier_preview', 'filename_nearest_preview']).to_csv('outliers.csv', index=False)
# ## Interactive Exploration
# In addition to the static visualizations presented above, fastdup also offers interactive exploration of the dataset.
#
# To explore the dataset and issues interactively in a browser, run:
# In[ ]:
fd.explore()
# > 🗒 **Note** - This currently requires you to sign-up (for free) to view the interactive exploration. Alternatively, you can visualize fastdup in a non-interactive way using fastdup's built in galleries shown in the upcoming cells.
#
# You'll be presented with a web interface that lets you conveniently view, filter, and curate your dataset in a web interface.
#
#
# 
# ## Wrap Up
#
# That's a wrap! In this notebook, we showed how to get mislabels from a labeled dataset.
#
#
# Next, feel free to check out other tutorials -
#
# + ⚡ [**Quickstart**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/quick-dataset-analysis.ipynb): Learn how to install fastdup, load a dataset and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you're new, start here!
# + 🧹 [**Clean Image Folder**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/cleaning-image-dataset.ipynb): Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
# + 🖼 [**Analyze Image Classification Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-image-classification-dataset.ipynb): Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
# + 🎁 [**Analyze Object Detection Dataset**](https://nbviewer.org/github/visual-layer/fastdup/blob/main/examples/analyzing-object-detection-dataset.ipynb): Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.
#
# As usual, feedback is welcome! Questions? Drop by our [Slack channel](https://visualdatabase.slack.com/join/shared_invite/zt-19jaydbjn-lNDEDkgvSI1QwbTXSY6dlA#/shared-invite/email) or open an issue on [GitHub](https://github.com/visual-layer/fastdup/issues).
#
#