Skip to content

Updates the Github documentation for v1.0 release. #712

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 59 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
03b7939
Adding section for UDF serialization
Niharikadutta Apr 20, 2020
4ef693d
removing guides from master
Niharikadutta Apr 20, 2020
81145ca
Merge latest from master
Niharikadutta May 6, 2020
e4b81af
merging latest from master
Niharikadutta May 7, 2020
4c32173
Merge remote-tracking branch 'upstream/master'
Niharikadutta Jun 2, 2020
4987a09
Merge remote-tracking branch 'upstream/master'
Niharikadutta Jun 14, 2020
ca9612e
Merge remote-tracking branch 'upstream/master'
Niharikadutta Jun 16, 2020
f581c86
Merge remote-tracking branch 'upstream/master'
Niharikadutta Jun 20, 2020
086b325
Merge remote-tracking branch 'upstream/master'
Niharikadutta Jun 23, 2020
2f72907
Merge remote-tracking branch 'upstream/master'
Niharikadutta Jul 25, 2020
6bab996
CountVectorizer
Jul 27, 2020
e2a566b
moving private methods to bottom
Jul 27, 2020
5f682a6
changing wrap method
Jul 28, 2020
31371db
setting min version required
Jul 31, 2020
60eb82f
undoing csproj change
Jul 31, 2020
ed36375
member doesnt need to be internal
Jul 31, 2020
c7baf72
too many lines
Jul 31, 2020
d13303c
removing whitespace change
Jul 31, 2020
f5b477c
removing whitespace change
Jul 31, 2020
73db52b
ionide
Jul 31, 2020
98f5e4d
Merge remote-tracking branch 'upstream/master'
Niharikadutta Aug 7, 2020
4c5d502
Merge remote-tracking branch 'upstream/master'
Niharikadutta Aug 10, 2020
a766146
Merge branch 'master' into ml/countvectorizer
GoEddie Aug 12, 2020
ad6bced
Merge branch 'ml/countvectorizer' of https://github.com/GoEddie/spark
Niharikadutta Aug 13, 2020
8e1685c
Revert "Merge branch 'master' into ml/countvectorizer"
Niharikadutta Aug 13, 2020
255515e
Revert "Merge branch 'ml/countvectorizer' of https://github.com/GoEdd…
Niharikadutta Aug 13, 2020
a44c882
Merge remote-tracking branch 'upstream/master'
Niharikadutta Aug 14, 2020
3c2c936
fixing merge errors
Niharikadutta Aug 14, 2020
88e834d
removing ionid
Niharikadutta Aug 20, 2020
59e7299
Merge remote-tracking branch 'upstream/master'
Niharikadutta Aug 20, 2020
a13de2d
Merge branch 'master' of github.com:Niharikadutta/spark
Niharikadutta Aug 21, 2020
13d0e4a
Merge remote-tracking branch 'upstream/master'
Niharikadutta Aug 24, 2020
595b141
Merge remote-tracking branch 'upstream/master'
Niharikadutta Aug 29, 2020
decfa48
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 2, 2020
ce694ff
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 8, 2020
8128ba0
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 12, 2020
52f0a74
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 19, 2020
6a89f01
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 24, 2020
4b1de41
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 25, 2020
929d8e2
Merge remote-tracking branch 'upstream/master'
Niharikadutta Sep 26, 2020
ffa0a4d
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 2, 2020
2579faa
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 5, 2020
39b3950
first draft
Niharikadutta Oct 5, 2020
2297add
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 6, 2020
daade7a
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 8, 2020
cb6aa7a
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 12, 2020
cbe6e50
Merge branch 'master' of github.com:Niharikadutta/spark
Niharikadutta Oct 12, 2020
3a04b19
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 12, 2020
9377692
removing duplicate docs
Niharikadutta Oct 12, 2020
1295934
reverting table changes
Niharikadutta Oct 12, 2020
7497bd7
changes
Niharikadutta Oct 12, 2020
2c498dc
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 13, 2020
d19cfb6
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 16, 2020
d34188e
Merge branch 'master' of github.com:Niharikadutta/spark
Niharikadutta Oct 16, 2020
5457ffb
Merge remote-tracking branch 'upstream/master'
Niharikadutta Oct 26, 2020
2a44453
Merge remote-tracking branch 'upstream/master'
Niharikadutta Nov 4, 2020
3ec9756
Merge remote-tracking branch 'upstream/master'
Niharikadutta Nov 12, 2020
144233b
Merge remote-tracking branch 'upstream/master'
Niharikadutta Nov 18, 2020
ff38239
resolving merge conflicts
Niharikadutta Nov 18, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using System.Collections.Generic;
using System.IO;
using Microsoft.Spark.ML.Feature;
using Microsoft.Spark.Sql;
using Microsoft.Spark.UnitTest.TestUtils;
using Xunit;

namespace Microsoft.Spark.E2ETest.IpcTests.ML.Feature
{
[Collection("Spark E2E Tests")]
public class CountVectorizerModelTests
{
private readonly SparkSession _spark;

public CountVectorizerModelTests(SparkFixture fixture)
{
_spark = fixture.Spark;
}

[Fact]
public void Test_CountVectorizerModel()
{
DataFrame input = _spark.Sql("SELECT array('hello', 'I', 'AM', 'a', 'string', 'TO', " +
"'TOKENIZE') as input from range(100)");

const string inputColumn = "input";
const string outputColumn = "output";
const double minTf = 10.0;
const bool binary = false;

List<string> vocabulary = new List<string>()
{
"hello",
"I",
"AM",
"TO",
"TOKENIZE"
};

var countVectorizerModel = new CountVectorizerModel(vocabulary);

Assert.IsType<CountVectorizerModel>(new CountVectorizerModel("my-uid", vocabulary));

countVectorizerModel = countVectorizerModel
.SetInputCol(inputColumn)
.SetOutputCol(outputColumn)
.SetMinTF(minTf)
.SetBinary(binary);

Assert.Equal(inputColumn, countVectorizerModel.GetInputCol());
Assert.Equal(outputColumn, countVectorizerModel.GetOutputCol());
Assert.Equal(minTf, countVectorizerModel.GetMinTF());
Assert.Equal(binary, countVectorizerModel.GetBinary());
using (var tempDirectory = new TemporaryDirectory())
{
string savePath = Path.Join(tempDirectory.Path, "countVectorizerModel");
countVectorizerModel.Save(savePath);

CountVectorizerModel loadedModel = CountVectorizerModel.Load(savePath);
Assert.Equal(countVectorizerModel.Uid(), loadedModel.Uid());
}

Assert.IsType<int>(countVectorizerModel.GetVocabSize());
Assert.NotEmpty(countVectorizerModel.ExplainParams());
Assert.NotEmpty(countVectorizerModel.ToString());
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System;
using System.IO;
using Microsoft.Spark.E2ETest.Utils;
using Microsoft.Spark.ML.Feature;
using Microsoft.Spark.Sql;
using Microsoft.Spark.UnitTest.TestUtils;
using Xunit;

namespace Microsoft.Spark.E2ETest.IpcTests.ML.Feature
{
[Collection("Spark E2E Tests")]
public class CountVectorizerTests
{
private readonly SparkSession _spark;

public CountVectorizerTests(SparkFixture fixture)
{
_spark = fixture.Spark;
}

[Fact]
public void Test_CountVectorizer()
{
DataFrame input = _spark.Sql("SELECT array('hello', 'I', 'AM', 'a', 'string', 'TO', " +
"'TOKENIZE') as input from range(100)");

const string inputColumn = "input";
const string outputColumn = "output";
const double minDf = 1;
const double minTf = 10;
const int vocabSize = 10000;
const bool binary = false;

var countVectorizer = new CountVectorizer();

countVectorizer
.SetInputCol(inputColumn)
.SetOutputCol(outputColumn)
.SetMinDF(minDf)
.SetMinTF(minTf)
.SetVocabSize(vocabSize);

Assert.IsType<CountVectorizerModel>(countVectorizer.Fit(input));
Assert.Equal(inputColumn, countVectorizer.GetInputCol());
Assert.Equal(outputColumn, countVectorizer.GetOutputCol());
Assert.Equal(minDf, countVectorizer.GetMinDF());
Assert.Equal(minTf, countVectorizer.GetMinTF());
Assert.Equal(vocabSize, countVectorizer.GetVocabSize());
Assert.Equal(binary, countVectorizer.GetBinary());

using (var tempDirectory = new TemporaryDirectory())
{
string savePath = Path.Join(tempDirectory.Path, "countVectorizer");
countVectorizer.Save(savePath);

CountVectorizer loadedVectorizer = CountVectorizer.Load(savePath);
Assert.Equal(countVectorizer.Uid(), loadedVectorizer.Uid());
}

Assert.NotEmpty(countVectorizer.ExplainParams());
Assert.NotEmpty(countVectorizer.ToString());
}

[SkipIfSparkVersionIsLessThan(Versions.V2_4_0)]
public void CountVectorizer_MaxDF()
{
const double maxDf = 100;
CountVectorizer countVectorizer = new CountVectorizer().SetMaxDF(maxDf);
Assert.Equal(maxDf, countVectorizer.GetMaxDF());
}
}
}
197 changes: 197 additions & 0 deletions src/csharp/Microsoft.Spark/ML/Feature/CountVectorizer.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using Microsoft.Spark.Interop;
using Microsoft.Spark.Interop.Ipc;
using Microsoft.Spark.Sql;

namespace Microsoft.Spark.ML.Feature
{
public class CountVectorizer : FeatureBase<CountVectorizer>, IJvmObjectReferenceProvider
{
private static readonly string s_countVectorizerClassName =
"org.apache.spark.ml.feature.CountVectorizer";

/// <summary>
/// Create a <see cref="CountVectorizer"/> without any parameters
/// </summary>
public CountVectorizer() : base(s_countVectorizerClassName)
{
}

/// <summary>
/// Create a <see cref="CountVectorizer"/> with a UID that is used to give the
/// <see cref="CountVectorizer"/> a unique ID
/// </summary>
/// <param name="uid">An immutable unique ID for the object and its derivatives.</param>
public CountVectorizer(string uid) : base(s_countVectorizerClassName, uid)
{
}

internal CountVectorizer(JvmObjectReference jvmObject) : base(jvmObject)
{
}

JvmObjectReference IJvmObjectReferenceProvider.Reference => _jvmObject;

/// <summary>Fits a model to the input data.</summary>
/// <param name="dataFrame">The <see cref="DataFrame"/> to fit the model to.</param>
/// <returns><see cref="CountVectorizerModel"/></returns>
public CountVectorizerModel Fit(DataFrame dataFrame) =>
new CountVectorizerModel((JvmObjectReference)_jvmObject.Invoke("fit", dataFrame));

/// <summary>
/// Loads the <see cref="CountVectorizer"/> that was previously saved using Save
/// </summary>
/// <param name="path">
/// The path the previous <see cref="CountVectorizer"/> was saved to
/// </param>
/// <returns>New <see cref="CountVectorizer"/> object</returns>
public static CountVectorizer Load(string path) =>
WrapAsCountVectorizer((JvmObjectReference)
SparkEnvironment.JvmBridge.CallStaticJavaMethod(
s_countVectorizerClassName,"load", path));

/// <summary>
/// Gets the binary toggle to control the output vector values. If True, all nonzero counts
/// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic
/// models that model binary events rather than integer counts. Default: false
/// </summary>
/// <returns>boolean</returns>
public bool GetBinary() => (bool)_jvmObject.Invoke("getBinary");

/// <summary>
/// Sets the binary toggle to control the output vector values. If True, all nonzero counts
/// (after minTF filter applied) are set to 1. This is useful for discrete probabilistic
/// models that model binary events rather than integer counts. Default: false
/// </summary>
/// <param name="value">Turn the binary toggle on or off</param>
/// <returns><see cref="CountVectorizer"/> with the new binary toggle value set</returns>
public CountVectorizer SetBinary(bool value) =>
WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setBinary", value));

/// <summary>
/// Gets the column that the <see cref="CountVectorizer"/> should read from and convert
/// into buckets. This would have been set by SetInputCol
/// </summary>
/// <returns>string, the input column</returns>
public string GetInputCol() => _jvmObject.Invoke("getInputCol") as string;

/// <summary>
/// Sets the column that the <see cref="CountVectorizer"/> should read from.
/// </summary>
/// <param name="value">The name of the column to as the source.</param>
/// <returns><see cref="CountVectorizer"/> with the input column set</returns>
public CountVectorizer SetInputCol(string value) =>
WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setInputCol", value));

/// <summary>
/// The <see cref="CountVectorizer"/> will create a new column in the DataFrame, this is
/// the name of the new column.
/// </summary>
/// <returns>The name of the output column.</returns>
public string GetOutputCol() => _jvmObject.Invoke("getOutputCol") as string;

/// <summary>
/// The <see cref="CountVectorizer"/> will create a new column in the DataFrame, this
/// is the name of the new column.
/// </summary>
/// <param name="value">The name of the output column which will be created.</param>
/// <returns>New <see cref="CountVectorizer"/> with the output column set</returns>
public CountVectorizer SetOutputCol(string value) =>
WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setOutputCol", value));

/// <summary>
/// Gets the maximum number of different documents a term could appear in to be included in
/// the vocabulary. A term that appears more than the threshold will be ignored. If this is
/// an integer greater than or equal to 1, this specifies the maximum number of documents
/// the term could appear in; if this is a double in [0,1), then this specifies the maximum
/// fraction of documents the term could appear in.
/// </summary>
/// <returns>The maximum document term frequency</returns>
[Since(Versions.V2_4_0)]
public double GetMaxDF() => (double)_jvmObject.Invoke("getMaxDF");

/// <summary>
/// Sets the maximum number of different documents a term could appear in to be included in
/// the vocabulary. A term that appears more than the threshold will be ignored. If this is
/// an integer greater than or equal to 1, this specifies the maximum number of documents
/// the term could appear in; if this is a double in [0,1), then this specifies the maximum
/// fraction of documents the term could appear in.
/// </summary>
/// <param name="value">The maximum document term frequency</param>
/// <returns>New <see cref="CountVectorizer"/> with the max df value set</returns>
[Since(Versions.V2_4_0)]
public CountVectorizer SetMaxDF(double value) =>
WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMaxDF", value));

/// <summary>
/// Gets the minimum number of different documents a term must appear in to be included in
/// the vocabulary. If this is an integer greater than or equal to 1, this specifies the
/// number of documents the term must appear in; if this is a double in [0,1), then this
/// specifies the fraction of documents.
/// </summary>
/// <returns>The minimum document term frequency</returns>
public double GetMinDF() => (double)_jvmObject.Invoke("getMinDF");

/// <summary>
/// Sets the minimum number of different documents a term must appear in to be included in
/// the vocabulary. If this is an integer greater than or equal to 1, this specifies the
/// number of documents the term must appear in; if this is a double in [0,1), then this
/// specifies the fraction of documents.
/// </summary>
/// <param name="value">The minimum document term frequency</param>
/// <returns>New <see cref="CountVectorizer"/> with the min df value set</returns>
public CountVectorizer SetMinDF(double value) =>
WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMinDF", value));

/// <summary>
/// Filter to ignore rare words in a document. For each document, terms with
/// frequency/count less than the given threshold are ignored. If this is an integer
/// greater than or equal to 1, then this specifies a count (of times the term must appear
/// in the document); if this is a double in [0,1), then this specifies a fraction (out of
/// the document's token count).
///
/// Note that the parameter is only used in transform of CountVectorizerModel and does not
/// affect fitting.
/// </summary>
/// <returns>Minimum term frequency</returns>
public double GetMinTF() => (double)_jvmObject.Invoke("getMinTF");

/// <summary>
/// Filter to ignore rare words in a document. For each document, terms with
/// frequency/count less than the given threshold are ignored. If this is an integer
/// greater than or equal to 1, then this specifies a count (of times the term must appear
/// in the document); if this is a double in [0,1), then this specifies a fraction (out of
/// the document's token count).
///
/// Note that the parameter is only used in transform of CountVectorizerModel and does not
/// affect fitting.
/// </summary>
/// <param name="value">Minimum term frequency</param>
/// <returns>New <see cref="CountVectorizer"/> with the min term frequency set</returns>
public CountVectorizer SetMinTF(double value) =>
WrapAsCountVectorizer((JvmObjectReference)_jvmObject.Invoke("setMinTF", value));

/// <summary>
/// Gets the max size of the vocabulary. CountVectorizer will build a vocabulary that only
/// considers the top vocabSize terms ordered by term frequency across the corpus.
/// </summary>
/// <returns>The max size of the vocabulary</returns>
public int GetVocabSize() => (int)_jvmObject.Invoke("getVocabSize");

/// <summary>
/// Sets the max size of the vocabulary. <see cref="CountVectorizer"/> will build a
/// vocabulary that only considers the top vocabSize terms ordered by term frequency across
/// the corpus.
/// </summary>
/// <param name="value">The max vocabulary size</param>
/// <returns><see cref="CountVectorizer"/> with the max vocab value set</returns>
public CountVectorizer SetVocabSize(int value) =>
WrapAsCountVectorizer(_jvmObject.Invoke("setVocabSize", value));

private static CountVectorizer WrapAsCountVectorizer(object obj) =>
new CountVectorizer((JvmObjectReference)obj);
}
}
Loading