Skip to content

Problem with ML.NET RobustScaler #5237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
CBrauer opened this issue Jun 12, 2020 · 4 comments
Closed

Problem with ML.NET RobustScaler #5237

CBrauer opened this issue Jun 12, 2020 · 4 comments
Assignees
Labels
wontfix This will not be worked on

Comments

@CBrauer
Copy link

CBrauer commented Jun 12, 2020

System information

  • Windows 10 Enterprise 10.0 18363 Built 18363
  • Visual Studio 2019, build 16.6.2

Source code

Program output. Notice that RobustScaler produced an extra column for "vwapGain"

image

Source code

My test program looks like:

namespace Test_RobustScaller {
  internal class Program {
    #region MyHead
    public static void MyHead(IDataView train, int numRows) {
      var trainPreview = train.Preview(maxRows: numRows);
      var nColumns = trainPreview.ColumnView.Length;
      var maxCharInHeaderName = 0;
      for (var k = 0; k < nColumns; k++) {
        var columnName = trainPreview.Schema[k].Name;
        maxCharInHeaderName = Math.Max(maxCharInHeaderName, columnName.Length);
      }
      var nSpaces = new int[nColumns];
      for (var k = 0; k < nColumns; k++) {
        var columnName = trainPreview.Schema[k].Name;
        for (var j = 0; j < maxCharInHeaderName - columnName.Length + 1; j++) {
          Console.Write(" ");
        }
        Console.Write("{0}", columnName);
        nSpaces[k] = maxCharInHeaderName - columnName.Length + 1;
      }
      Console.Write("\n");

      foreach (var row in trainPreview.RowView) {
        for (var k = 0; k < row.Values.Length; k++) {
          var field = string.Format("{0}", row.Values[k].Value);
          var nSpace = maxCharInHeaderName - field.Length + 1;
          for (var j = 0; j < nSpace; j++) {
            Console.Write(" ");
          }
          Console.Write(row.Values[k].Value);
        }
        Console.Write("\n");
      }

      Console.Write("\n");
    }
    #endregion
    public static void Run() {
      var mlContext = new MLContext(seed: 1);

      var df_full = DataFrame.LoadCsv("../../../data/model.csv");

      var header_names = new List<string> {
        "BoxRatio", "Thrust", "Acceleration", "Velocity",
        "OnBalRun", "vwapGain", "Altitude"
      };
      var nColumns = header_names.Count;
      var df_columns = new DataFrameColumn[nColumns];
      for (var k = 0; k < nColumns; k++) {
        var name = header_names[k];
        df_columns[k] = df_full.Columns[name];
      }

      var df = new DataFrame(df_columns);
      Console.WriteLine("Before transform:");
      Console.WriteLine(df.Head(5));

      var pipeline = mlContext.Transforms.RobustScaler("vwapGain");
      var model = pipeline.Fit(df);
      var transformed = model.Transform(df);
      Console.WriteLine("After Transform:");
      MyHead(transformed, 5);
    }

    static void Main() {
      Run();
      Console.WriteLine("Hit return to exit.");
      Console.ReadKey();
    }
  }
}

Charles

@mstfbl mstfbl self-assigned this Jun 12, 2020
@mstfbl
Copy link
Contributor

mstfbl commented Jun 13, 2020

Hi @CBrauer ,

Thank you for reporting this issue. I see that you are using outdated libraries in your codebase. For example:

  • mlContext.Transforms.NormalizeRobustScaling instead of mlContext.Transforms.RobustScaler
  • mlContext.Data.CreateTextLoader instead of DataFrame.LoadCsv
  • IDataView instead of DataFrame

Please check out the current ML.NET API to view more, and check if you obtain the same extra column for "vwapGain" with mlContext.Transforms.NormalizeRobustScaling.
I'm closing this issue for now, feel free to reopen if after updating your code you have the same issue. Thanks.

@mstfbl mstfbl closed this as completed Jun 13, 2020
@mstfbl mstfbl added the wontfix This will not be worked on label Jun 13, 2020
@CBrauer
Copy link
Author

CBrauer commented Jun 13, 2020

I strongly object to your closing this issue. Your reply did not address my issue, and it is full of errors.
I went to a lot of trouble to build a test app that demonstrates the issue. I would like to make the following four points:

  1. The test app was built with the latest release of ML.NET. I am not using outdated libraries, as you can see by the following screen capture
    screen1

  2. There is no such method as mlContext.Transforms.NormalizeRobustScaling. The following screen capture shows this:
    screen2

  3. The argument for RobustScaller does not include "inplace". Coming from the SciKit-Learn world, it does not make sense to me to create a new column in my dataset.
    screen3

  4. If you are doing contract work for Microsoft, I would like the name and email address of your manager. I would like to send him/her a complaint.
    Charles

@mstfbl
Copy link
Contributor

mstfbl commented Jun 14, 2020

Hi @CBrauer ,

To address your point 1., what I meant by your usage of earlier libraries is that you are using functions like DataFrame.LoadCsv and classes like DataFrame that we no longer use, and recommended that you use mlContext.Data.CreateTextLoader and IDataView instead.

To address your point 2., we did indeed add mlContext.Transforms.NormalizeRobustScaling in our NormalizerCatalog with PR #5166. For your reference, here are the public declarations of these two NormalizeRobustScaling functions on our current codebase depending on how you would like to provide your input and output columns:

/// <summary>
/// Create a <see cref="NormalizingEstimator"/>, which normalizes using statistics that are robust to outliers by centering the data around 0 (removing the median) and scales
/// the data according to the quantile range (defaults to the interquartile range).
/// </summary>
/// <param name="catalog">The transform catalog</param>
/// <param name="outputColumnName">Name of the column resulting from the transformation of <paramref name="inputColumnName"/>.
/// The data type on this column is the same as the input column.</param>
/// <param name="inputColumnName">Name of the column to transform. If set to <see langword="null"/>, the value of the <paramref name="outputColumnName"/> will be used as source.
/// The data type on this column should be <see cref="System.Single"/>, <see cref="System.Double"/> or a known-sized vector of those types.</param>
/// <param name="maximumExampleCount">Maximum number of examples used to train the normalizer.</param>
/// <param name="centerData">Whether to center the data around 0 by removing the median. Defaults to true.</param>
/// <param name="quantileMin">Quantile min used to scale the data. Defaults to 25.</param>
/// <param name="quantileMax">Quantile max used to scale the data. Defaults to 75.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[NormalizeRobustScaling](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/NormalizeSupervisedBinning.cs)]
/// ]]>
/// </format>
/// </example>
public static NormalizingEstimator NormalizeRobustScaling(this TransformsCatalog catalog,
string outputColumnName, string inputColumnName = null,
long maximumExampleCount = NormalizingEstimator.Defaults.MaximumExampleCount,
bool centerData = NormalizingEstimator.Defaults.CenterData,
uint quantileMin = NormalizingEstimator.Defaults.QuantileMin,
uint quantileMax = NormalizingEstimator.Defaults.QuantileMax)
{
var columnOptions = new NormalizingEstimator.RobustScalingColumnOptions(outputColumnName, inputColumnName, maximumExampleCount, centerData, quantileMin, quantileMax);
return new NormalizingEstimator(CatalogUtils.GetEnvironment(catalog), columnOptions);
}
/// <summary>
/// Create a <see cref="NormalizingEstimator"/>, which normalizes using statistics that are robust to outliers by centering the data around 0 (removing the median) and scales
/// the data according to the quantile range (defaults to the interquartile range).
/// </summary>
/// <param name="catalog">The transform catalog</param>
/// <param name="columns">The pairs of input and output columns.
/// The input columns must be of data type <see cref="System.Single"/>, <see cref="System.Double"/> or a known-sized vector of those types.
/// The data type for the output column will be the same as the associated input column.</param>
/// <param name="maximumExampleCount">Maximum number of examples used to train the normalizer.</param>
/// <param name="centerData">Whether to center the data around 0 be removing the median. Defaults to true.</param>
/// <param name="quantileMin">Quantile min used to scale the data. Defaults to 25.</param>
/// <param name="quantileMax">Quantile max used to scale the data. Defaults to 75.</param>
/// <example>
/// <format type="text/markdown">
/// <![CDATA[
/// [!code-csharp[NormalizeBinning](~/../docs/samples/docs/samples/Microsoft.ML.Samples/Dynamic/Transforms/NormalizeBinningMulticolumn.cs)]
/// ]]>
/// </format>
/// </example>
public static NormalizingEstimator NormalizeRobustScaling(this TransformsCatalog catalog, InputOutputColumnPair[] columns,
long maximumExampleCount = NormalizingEstimator.Defaults.MaximumExampleCount,
bool centerData = NormalizingEstimator.Defaults.CenterData,
uint quantileMin = NormalizingEstimator.Defaults.QuantileMin,
uint quantileMax = NormalizingEstimator.Defaults.QuantileMax) =>
new NormalizingEstimator(CatalogUtils.GetEnvironment(catalog),
columns.Select(column =>
new NormalizingEstimator.RobustScalingColumnOptions(column.OutputColumnName, column.InputColumnName, maximumExampleCount, centerData, quantileMin, quantileMax)).ToArray());

It is weird that you are not seeing the declared NormalizeRobustScaling functions, and I have confirmed that I also cannot see these functions with the installed NuGet packages Microsoft.MLFeaturizers v0.4.1, Microsoft.ML.Featurizers v0.17.0, and Microsoft.ML v1.5.0. I will check with the team on this, but this specific issue is outside the scope of the issue you originally reported here, which I will explain below.

To address your point 3., I do not know what exactly you mean here, but I believe I understand why you are seeing two "vwapGain" columns. The first "vwapGain" column you are seeing is hidden, where the hidden column is only accessible through providing its specific index in the output schema, which is exactly how you are accessing this column.

This hidden column(s) is there by design, and the logic behind hidden columns is explained in detail here. In short, the RobustScaler transformer you're using is using the 1st "vwapGain" column to simply compute and add a 2nd "vwapGain" column. As the 2nd "vwapGain" column is newer, the 1st "vwapGain" column is hidden. Both the 1st and 2nd "vwapGain" columns exist, and the hidden 1st "vwapGain" column is not removed on purpose, for savers and also diagnostics purposes.

For context, when there exists 2+ columns with the same name, the column with the higher index is visible, and other column(s) are marked as "hidden". If you use a IDataView cursor to properly iterate through rows (instead of using Microsoft.ML.Data.DataDebuggerPreview as you are in line 16), you will not see this hidden "vwapGain". For more information on using IDataView's and iterating through IDataView's, please follow this tutorial on using DataViewRowCursor's.

To explain my point above, I have added the following snippet of code in your MyHead(IDataView train, int numRows) function, where I am printing whether or not each of these columns are hidden:

nSpaces = new int[nColumns];
for (var k = 0; k < nColumns; k++)
{
    var isHidden = trainPreview.Schema[k].IsHidden;
    for (var j = 0; j < maxCharInHeaderName - isHidden.ToString().Length + 1; j++)
    {
        Console.Write(" ");
    }
    Console.Write("isHidden: {0}", isHidden);
    nSpaces[k] = maxCharInHeaderName - isHidden.ToString().Length + 1;
}
Console.Write("\n");

Here's the output with my added snippet:
out

As you can see, the first "vwapGain" column is hidden, while the second "vwapGain" column is not, as befits the logic explained above.

So, in summary, the problem you're referring to with the extra "vwGain" column, is not a problem, but an intentional design choice.

To address your point 4., I am not doing contract work for Microsoft, but I am confused to exactly which errors you are referring to and what complaint you have. As I have done in this specific comment, I am happy to explain any other points you do not yet understand in ML.NET, and/or point you to the right resources.

However, as I have explained the reason why you are seeing two "vwapGain" Columns (1 hidden, 1 visible), how you are accessing the hidden column through its index (which is the only way to access this column), and how this hidden column is intended and by design, this issue will remain closed. The non-visibility of mlContext.Transforms.NormalizeRobustScaling, while indirectly related to this issue, if we determine it to be a real issue, shall be an issue opened separately. Thanks.

@CBrauer
Copy link
Author

CBrauer commented Jun 14, 2020

Excellent explanations. I appreciate it.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants