You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
/// - The K-means|| method. This method was introduced [here](https://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf) by Bahmani et al., and uses
57
57
/// a parallel method that drastically reduces the number of passes needed to obtain a good initialization.
58
58
///
59
-
/// The latter is the default initialization method. The other methods can be specified in the [Options](xref:Microsoft.ML.Trainers.KMeansTrainer.Options)
59
+
/// K-means|| is the default initialization method. The other methods can be specified in the [Options](xref:Microsoft.ML.Trainers.KMeansTrainer.Options)
/// The output Score column contains the $L_2$-norm distance (i.e., [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)) of the given input vector $\textbf{x}\in \mathbb{R}^n$ to each cluster's centroid.
65
65
/// Assume that the centriod of the $c$-th cluster is $\textbf{m}_c \in \mathbb{R}^n$.
66
-
/// The $c$-th value at the Score column would be $d_c = || \textbf{x} - \textbf{m}_c ||_2^2$.
67
-
/// The predicted label is the index with the smallest value in a $K$ dimension vector $[d_{0}, \dots, d_{K-1}]$, where $K$ is the number of clusters.
66
+
/// The $c$-th value at the Score column would be $d_c = || \textbf{x} - \textbf{m}\_c ||\_2^2$.
67
+
/// The predicted label is the index with the smallest value in a $K$ dimensional vector $[d\_{0}, \dots, d\_{K-1}]$, where $K$ is the number of clusters.
68
68
///
69
69
/// For more information on K-means, and K-means++ see:
/// or [FieldAwareFactorizationMachine(Options)](xref:Microsoft.ML.FactorizationMachineExtensions.FieldAwareFactorizationMachine(Microsoft.ML.BinaryClassificationCatalog.BinaryClassificationTrainers,Microsoft.ML.Trainers.FieldAwareFactorizationMachineTrainer.Options)).
42
42
///
43
-
/// In contrast to other binary classifiers which can only support one feature column, field-aware factorization machine can consume multiple feature columns.
43
+
/// In contrast to other binary classifiers, which can only support one feature column, field-aware factorization machine can consume multiple feature columns.
44
44
/// Each column is viewed as a container of some features and such a container is called a field.
45
45
/// Note that all feature columns must be float vectors but their dimensions can be different.
46
46
/// The motivation of splitting features into different fields is to model features from different distributions independently.
@@ -68,8 +68,8 @@ namespace Microsoft.ML.Trainers
68
68
/// ### Scoring Function
69
69
/// Field-aware factorization machine is a scoring function which maps feature vectors from different fields to a scalar score.
70
70
/// Assume that all $m$ feature columns are concatenated into a long feature vector $\textbf{x} \in {\mathbb R}^n$ and ${\mathcal F}(j)$ denotes the $j$-th feature's field indentifier.
/// where $\left\langle \cdot, \cdot \right\rangle$ is the inner product operator, $\textbf{w} \in {\mathbb R}^n$ stores the linear coefficients, and $\textbf{v}_{j, f}\in {\mathbb R}^k$ is the $j$-th feature's representation in the $f$-th field's latent space.
/// where $\langle \cdot, \cdot \rangle$ is the inner product operator, $\textbf{w} \in {\mathbb R}^n$ stores the linear coefficients, and $\textbf{v}_{j, f}\in {\mathbb R}^k$ is the $j$-th feature's representation in the $f$-th field's latent space.
73
73
/// Note that $k$ is the latent dimension specified by the user.
74
74
///
75
75
/// The predicted label is the sign of $\hat{y}$. If $\hat{y} > 0$, this model predicts true. Otherwise, it predicts false.
/// [Maximum entropy model](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) is a generalization of linear [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).
62
-
/// The major difference between maximum entropy model and logistic regression is that the number of classes supported in considered classification problem.
62
+
/// The major difference between maximum entropy model and logistic regression is the number of classes supported in the considered classification problem.
63
63
/// Logistic regression is only for binary classification while maximum entropy model handles multiple classes.
64
64
/// See Section 1 in [this paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf) for a detailed introduction.
65
65
///
66
66
/// Assume that the number of classes is $m$ and number of features is $n$.
67
-
/// Maximum entropy model assigns the $c$-th class a coefficient vector $\textbf{w}_c \in {\mathbb R}^n$ and a bias $b_c \in {\mathbb R}$, for $c=1,\dots,m$.
68
-
/// Given a feature vector $\textbf{x} \in {\mathbb R}^n$, the $c$-th class's score is $\hat{y}^c = \textbf{w}_c^T \textbf{x} + b_c$.
69
-
/// The probability of $\textbf{x}$ belonging to class $c$ is defined by $\tilde{P}(c | \textbf{x}) = \frac{ e^{\hat{y}^c} }{ \sum_{c' = 1}^m e^{\hat{y}^{c'}} }$.
70
-
/// Let $P(c, \textbf{x})$ denote the join probability of seeing $c$ and $\textbf{x}$.
71
-
/// The loss function minimized by this trainer is $-\sum_{c = 1}^m P(c, \textbf{x}) \log \tilde{P}(c | \textbf{x}) $, which is the negative [log-likelihood function](https://en.wikipedia.org/wiki/Likelihood_function#Log-likelihood).
67
+
/// Maximum entropy model assigns the $c$-th class a coefficient vector $\textbf{w}\_c \in {\mathbb R}^n$ and a bias $b_c \in {\mathbb R}$, for $c=1,\dots,m$.
68
+
/// Given a feature vector $\textbf{x} \in {\mathbb R}^n$, the $c$-th class's score is $\hat{y}^c = \textbf{w}\_c^T \textbf{x} + b_c$.
69
+
/// The probability of $\textbf{x}$ belonging to class $c$ is defined by $\tilde{P}(c | \textbf{x}) = \frac{ e^{\hat{y}^c} }{ \sum\_{c' = 1}^m e^{\hat{y}^{c'}} }$.
70
+
/// Let $P(c, \textbf{x})$ denote the joint probability of seeing $c$ and $\textbf{x}$.
71
+
/// The loss function minimized by this trainer is $-\sum\_{c = 1}^m P(c, \textbf{x}) \log \tilde{P}(c | \textbf{x}) $, which is the negative [log-likelihood function](https://en.wikipedia.org/wiki/Likelihood_function#Log-likelihood).
72
72
///
73
73
/// ### Training Algorithm Details
74
74
/// The optimization technique implemented is based on [the limited memory Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS)](https://en.wikipedia.org/wiki/Limited-memory_BFGS).
75
-
/// L-BFGS is a [quasi-Newtonian method](https://en.wikipedia.org/wiki/Quasi-Newton_method) which replaces the expensive computation cost of Hessian matrix with an approximation but still enjoys a fast convergence rate like [Newton method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization) where the full Hessian matrix is computed.
76
-
/// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with high-dimensional feature vector.
77
-
/// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation to the Hessian matrix but also a higher computation cost per step.
75
+
/// L-BFGS is a [quasi-Newtonian method](https://en.wikipedia.org/wiki/Quasi-Newton_method), which replaces the expensive computation of the Hessian matrix with an approximation but still enjoys a fast convergence rate like [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization) where the full Hessian matrix is computed.
76
+
/// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with a high-dimensional feature vector.
77
+
/// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation of the Hessian matrix but also a higher computation cost per step.
78
78
///
79
-
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
80
-
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance trade-off.
81
-
/// Regularization works by adding the penalty that is associated with coefficient values to the error of the hypothesis.
79
+
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data. It prevents overfitting by penalizing the model's magnitude, usually measured by some norm functions.
80
+
/// This can improve the generalization of the model learned by selecting the optimal complexity in the [bias-variance trade-off](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
81
+
/// Regularization works by adding a penalty that is associated with coefficient values to the error of the hypothesis.
82
82
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
83
83
///
84
84
/// This learner supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w} ||_1$, and L2-norm (ridge), $|| \textbf{w} ||_2^2$ regularizations.
85
85
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
86
86
/// Using L1-norm can increase sparsity of the trained $\textbf{w}$.
87
87
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
88
-
/// If L1-norm regularization is used, the used training algorithm would be [QWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
88
+
/// If L1-norm regularization is used, the training algorithm used would be [OWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
89
89
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
90
90
///
91
-
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
91
+
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables from the model.
92
92
/// Therefore, choosing the right regularization coefficients is important when applying maximum entropy classifier.
93
93
///
94
94
/// Check the See Also section for links to usage examples.
0 commit comments