Skip to content

LoadImages not warning that input column is empty and ignoring imageFolder parameter in such a case #4429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
antoniovs1029 opened this issue Nov 1, 2019 · 7 comments · Fixed by #4691
Assignees
Labels
bug Something isn't working P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@antoniovs1029
Copy link
Member

antoniovs1029 commented Nov 1, 2019

I don't know if this is actually an issue, but it's something that I noticed when working with the LoadImages method, and perhaps it is necessary to warn the user that this could happen, whether at runtime, or in the documentation.

Issue

As I show in the source code I provide below, if all the values of the input column of a LoadImages transform are empty when fitting a pipeline, then the code will still run and not give any warning whatsoever, even though no image is actually loaded to train the model. The transform would also appear to work when transforming an input Data View which uses an empty column as input of the LoadImages, and, in the example I provide, the pipeline would still assign a predicted label to each row of the input data view bein transformed, even if no image was actually loaded.

I show this by exemplifying two main cases in which this could happen:

  1. When the user loads its data through a method such as LoadFromTextFile, with an input file that only has 2 columns, but the user specifies that the ImagePath column (to be used as input column of the LoadImages method) is the 3rd column inside of the file. This kind of scenario could happen if the user makes a typo in the ModelInput class, or if the user (perhaps mistakenly) passes an input file that doesn't contain an image path column.

  2. When the user loads its data through a method such as LoadFromEnumerable from an array where all of the objects provide either a null or an empty string value to the ImagePath column.

Also notice in my code that in both cases the LoadImages transform also ignores whatever is passed as the imageFolder parameter, since because there are actually no ImagePaths, it will never try to load images. If there was at least one ImagePath in the input dataview, then LoadImages would actually try to load that image using the imageFolder parameter, and an exception is correctly thrown if the folder doesn't exist.

Why is this a problem?

  • I would understand this behavior happening if the input data view doesn't provide an image path for some of the rows, specially when working with big datasets. But I think it becomes a problem if there's actually no image loaded, and the whole thing appears to work without a warning, like in the example I provided. If a user unknowingly makes a mistake that leads to this problem, then s/he might believe that the model was actually correctly trained with actual images, or that it actually transformed an input where no imagePath was provided. This problem might be harder to spot in more complex pipelines or input files.

  • Also the fact that the imageFolder parameter gets ignored in this case seems odd to me, as I would have expected an exception to be thrown if a user passes an inexistent folder path to the LoadImages transformer, regardless of the content of the ImagePath column.

Source code and input file

Download solution:
solution.zip

Or look into the code in here:
https://gist.github.com/antoniovs1029/997ca183411f173e81a131f09722b092

@frank-dong-ms-zz frank-dong-ms-zz added bug Something isn't working P1 Priority of the issue for triage purpose: Needs to be fixed soon. labels Jan 9, 2020
@mstfbl mstfbl self-assigned this Jan 13, 2020
@mstfbl
Copy link
Contributor

mstfbl commented Jan 22, 2020

Hi Antonio,

PR #4665 should fix the bug you raise with the inputFilePath parameter in mlContext.Data.LoadFromTextFile being ignored. I'm currently working on another PR that fixes the other points you've raised.

@mstfbl
Copy link
Contributor

mstfbl commented Jan 22, 2020

Hi Antonio,

In addition, the LoadImages parameter here is being ignored when it is empty because of the snippet below:

/// <summary>
/// Create a <see cref="ImageLoadingEstimator"/>, which loads the data from the column specified in <paramref name="inputColumnName"/>
/// as an image to a new column: <paramref name="outputColumnName"/>.
/// </summary>
/// <param name="catalog">The transform's catalog.</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// This column's data type will be <see cref="System.Drawing.Bitmap"/>.</param>
/// <param name="inputColumnName">Name of the column with paths to the images to load.
/// This estimator operates over text data.</param>
/// <param name="imageFolder">Folder where to look for images.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[LoadImages](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/ImageAnalytics/LoadImages.cs)]
/// ]]></format>
/// </example>
public static ImageLoadingEstimator LoadImages(this TransformsCatalog catalog, string outputColumnName, string imageFolder, string inputColumnName = null)
=> new ImageLoadingEstimator(CatalogUtils.GetEnvironment(catalog), imageFolder, true, new[] { (outputColumnName, inputColumnName ?? outputColumnName) });

If inputColumnName is empty/null, then instead of inputColumnName being returned, it is instead outputColumnName that is being returned. The ?? syntax in C# first checks the value on the left, and returns that if it isn't null, and if it is null, it returns the value on the right (whether it is null or not). In this case, as inputColumnName is null, it returns outputColumnName.

@antoniovs1029
Copy link
Member Author

Hi @mstfbl . Thanks for taking a look into this. As I've commented on your PR, I am afraid I didn't explained clearly the issue. My problem isn't with the inputColumnName parameter being null, but rather with the content of the inputColumn being empty.

So if you take the sample I provided in the first post of this issue, and run it, and take transformedDataView2.Preview() the output is as follow for the first row (i.e. row number 0):

image
(notice that in this case the "inputColumn" is called "ImagePath")

As you can see ImagePath_featurized is null (because imagePath was empty in the input dataset), OriginalInput is an empty sparse vector, and then, somehow (for whatever reason I am not aware of) PreprocessedInput become a vector with actual values (I don't know where it get the values, since the original vector was empty) and then, by seeing the other columns, you can see that the PredictedLabel for this row is 'dog' (although, again, there wasn't any image to begin with).

If you look into the other rows (which also had an empty imagePath), you will see that they also get a 'dog' predictedLabel.


As I've mentioned in the first post, I understand this is desired when working with big datasets where some rows have empty values in their inputColumn, and it doesn't matter if they get assigned whatever label to them. But this becomes a problem if all the rows have an empty value on that column (as shown in the sample I provided) because then the trainer wouldn't actually train on any image, and the pipeline would still run without warning the user about this... then the user might take that "trained" model and use it elsewhere without knowing that it didn't actually train.

A solution to this might be to modify the ImageLoader.cs file content in order to count how many empty inputs where there, and then check if the number of rows is the same as the number of empty columns... if they're the same, then throw an exception, or at least show a warning.

@antoniovs1029
Copy link
Member Author

Also, I've just added a solution.zip file inside my first post in this issue, so to make it easier for you to debug this. Thanks!

@antoniovs1029
Copy link
Member Author

PR #4691 did fix the problem of throwing an exception when the imageFolder doesn't exist, but it didn't fix the problem of not warning that the input column had only empty values, so I am reopening the issue.

@mstfbl
Copy link
Contributor

mstfbl commented Feb 13, 2020

Hey @antoniovs1029 ,

I ran the sample you provided, and saw that the Path.GetFullPath("C:\\I-dont-exist") function does not actually throw an exception when it actually should. PR #4831 should fix this issue.

@mstfbl
Copy link
Contributor

mstfbl commented Mar 30, 2020

Closing this issue as #4831 is merged.

@mstfbl mstfbl closed this as completed Mar 30, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants