Skip to content

Commit 2db070d

Browse files
authored
Latex fixes (dotnet#3635)
* Fix latex for FFM * Fix latex for KMeans++ * Fix latex for L-BFGS Multiclass * Indentation and typos
1 parent f0bc566 commit 2db070d

File tree

3 files changed

+20
-20
lines changed

3 files changed

+20
-20
lines changed

src/Microsoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,15 +56,15 @@ namespace Microsoft.ML.Trainers
5656
/// - The K-means|| method. This method was introduced [here](https://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf) by Bahmani et al., and uses
5757
/// a parallel method that drastically reduces the number of passes needed to obtain a good initialization.
5858
///
59-
/// The latter is the default initialization method. The other methods can be specified in the [Options](xref:Microsoft.ML.Trainers.KMeansTrainer.Options)
59+
/// K-means|| is the default initialization method. The other methods can be specified in the [Options](xref:Microsoft.ML.Trainers.KMeansTrainer.Options)
6060
/// when creating the trainer using
6161
/// [KMeansTrainer(Options)](xref:Microsoft.ML.KMeansClusteringExtensions.KMeans(Microsoft.ML.ClusteringCatalog.ClusteringTrainers,Microsoft.ML.Trainers.KMeansTrainer.Options)).
6262
///
6363
/// ### Scoring Function
6464
/// The output Score column contains the $L_2$-norm distance (i.e., [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)) of the given input vector $\textbf{x}\in \mathbb{R}^n$ to each cluster's centroid.
6565
/// Assume that the centriod of the $c$-th cluster is $\textbf{m}_c \in \mathbb{R}^n$.
66-
/// The $c$-th value at the Score column would be $d_c = || \textbf{x} - \textbf{m}_c ||_2^2$.
67-
/// The predicted label is the index with the smallest value in a $K$ dimension vector $[d_{0}, \dots, d_{K-1}]$, where $K$ is the number of clusters.
66+
/// The $c$-th value at the Score column would be $d_c = || \textbf{x} - \textbf{m}\_c ||\_2^2$.
67+
/// The predicted label is the index with the smallest value in a $K$ dimensional vector $[d\_{0}, \dots, d\_{K-1}]$, where $K$ is the number of clusters.
6868
///
6969
/// For more information on K-means, and K-means++ see:
7070
/// [K-means](https://en.wikipedia.org/wiki/K-means_clustering)

src/Microsoft.ML.StandardTrainers/FactorizationMachine/FactorizationMachineTrainer.cs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ namespace Microsoft.ML.Trainers
4040
/// [FieldAwareFactorizationMachine](xref:Microsoft.ML.FactorizationMachineExtensions.FieldAwareFactorizationMachine(Microsoft.ML.BinaryClassificationCatalog.BinaryClassificationTrainers,System.String[],System.String,System.String)),
4141
/// or [FieldAwareFactorizationMachine(Options)](xref:Microsoft.ML.FactorizationMachineExtensions.FieldAwareFactorizationMachine(Microsoft.ML.BinaryClassificationCatalog.BinaryClassificationTrainers,Microsoft.ML.Trainers.FieldAwareFactorizationMachineTrainer.Options)).
4242
///
43-
/// In contrast to other binary classifiers which can only support one feature column, field-aware factorization machine can consume multiple feature columns.
43+
/// In contrast to other binary classifiers, which can only support one feature column, field-aware factorization machine can consume multiple feature columns.
4444
/// Each column is viewed as a container of some features and such a container is called a field.
4545
/// Note that all feature columns must be float vectors but their dimensions can be different.
4646
/// The motivation of splitting features into different fields is to model features from different distributions independently.
@@ -68,8 +68,8 @@ namespace Microsoft.ML.Trainers
6868
/// ### Scoring Function
6969
/// Field-aware factorization machine is a scoring function which maps feature vectors from different fields to a scalar score.
7070
/// Assume that all $m$ feature columns are concatenated into a long feature vector $\textbf{x} \in {\mathbb R}^n$ and ${\mathcal F}(j)$ denotes the $j$-th feature's field indentifier.
71-
/// The corresponding score is $\hat{y}\left(\textbf{x}\right) = \left\langle \textbf{w}, \textbf{x} \right\rangle + \sum_{j = 1}^n \sum_{j' = j + 1}^n \left\langle \textbf{v}_{j, {\mathcal F}(j')} , \textbf{v}_{j', {\mathcal F}(j)} \right\rangle x_j x_{j'}$,
72-
/// where $\left\langle \cdot, \cdot \right\rangle$ is the inner product operator, $\textbf{w} \in {\mathbb R}^n$ stores the linear coefficients, and $\textbf{v}_{j, f}\in {\mathbb R}^k$ is the $j$-th feature's representation in the $f$-th field's latent space.
71+
/// The corresponding score is $\hat{y}(\textbf{x}) = \langle \textbf{w}, \textbf{x} \rangle + \sum_{j = 1}^n \sum_{j' = j + 1}^n \langle \textbf{v}\_{j, {\mathcal F}(j')}, \textbf{v}\_{j', {\mathcal F}(j)} \rangle x_j x_{j'}$,
72+
/// where $\langle \cdot, \cdot \rangle$ is the inner product operator, $\textbf{w} \in {\mathbb R}^n$ stores the linear coefficients, and $\textbf{v}_{j, f}\in {\mathbb R}^k$ is the $j$-th feature's representation in the $f$-th field's latent space.
7373
/// Note that $k$ is the latent dimension specified by the user.
7474
///
7575
/// The predicted label is the sign of $\hat{y}$. If $\hat{y} > 0$, this model predicts true. Otherwise, it predicts false.

src/Microsoft.ML.StandardTrainers/Standard/LogisticRegression/MulticlassLogisticRegression.cs

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -59,36 +59,36 @@ namespace Microsoft.ML.Trainers
5959
///
6060
/// ### Scoring Function
6161
/// [Maximum entropy model](https://en.wikipedia.org/wiki/Multinomial_logistic_regression) is a generalization of linear [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression).
62-
/// The major difference between maximum entropy model and logistic regression is that the number of classes supported in considered classification problem.
62+
/// The major difference between maximum entropy model and logistic regression is the number of classes supported in the considered classification problem.
6363
/// Logistic regression is only for binary classification while maximum entropy model handles multiple classes.
6464
/// See Section 1 in [this paper](https://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf) for a detailed introduction.
6565
///
6666
/// Assume that the number of classes is $m$ and number of features is $n$.
67-
/// Maximum entropy model assigns the $c$-th class a coefficient vector $\textbf{w}_c \in {\mathbb R}^n$ and a bias $b_c \in {\mathbb R}$, for $c=1,\dots,m$.
68-
/// Given a feature vector $\textbf{x} \in {\mathbb R}^n$, the $c$-th class's score is $\hat{y}^c = \textbf{w}_c^T \textbf{x} + b_c$.
69-
/// The probability of $\textbf{x}$ belonging to class $c$ is defined by $\tilde{P}(c | \textbf{x}) = \frac{ e^{\hat{y}^c} }{ \sum_{c' = 1}^m e^{\hat{y}^{c'}} }$.
70-
/// Let $P(c, \textbf{x})$ denote the join probability of seeing $c$ and $\textbf{x}$.
71-
/// The loss function minimized by this trainer is $-\sum_{c = 1}^m P(c, \textbf{x}) \log \tilde{P}(c | \textbf{x}) $, which is the negative [log-likelihood function](https://en.wikipedia.org/wiki/Likelihood_function#Log-likelihood).
67+
/// Maximum entropy model assigns the $c$-th class a coefficient vector $\textbf{w}\_c \in {\mathbb R}^n$ and a bias $b_c \in {\mathbb R}$, for $c=1,\dots,m$.
68+
/// Given a feature vector $\textbf{x} \in {\mathbb R}^n$, the $c$-th class's score is $\hat{y}^c = \textbf{w}\_c^T \textbf{x} + b_c$.
69+
/// The probability of $\textbf{x}$ belonging to class $c$ is defined by $\tilde{P}(c | \textbf{x}) = \frac{ e^{\hat{y}^c} }{ \sum\_{c' = 1}^m e^{\hat{y}^{c'}} }$.
70+
/// Let $P(c, \textbf{ x})$ denote the joint probability of seeing $c$ and $\textbf{x}$.
71+
/// The loss function minimized by this trainer is $-\sum\_{c = 1}^m P(c, \textbf{ x}) \log \tilde{P}(c | \textbf{x}) $, which is the negative [log-likelihood function](https://en.wikipedia.org/wiki/Likelihood_function#Log-likelihood).
7272
///
7373
/// ### Training Algorithm Details
7474
/// The optimization technique implemented is based on [the limited memory Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS)](https://en.wikipedia.org/wiki/Limited-memory_BFGS).
75-
/// L-BFGS is a [quasi-Newtonian method](https://en.wikipedia.org/wiki/Quasi-Newton_method) which replaces the expensive computation cost of Hessian matrix with an approximation but still enjoys a fast convergence rate like [Newton method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization) where the full Hessian matrix is computed.
76-
/// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with high-dimensional feature vector.
77-
/// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation to the Hessian matrix but also a higher computation cost per step.
75+
/// L-BFGS is a [quasi-Newtonian method](https://en.wikipedia.org/wiki/Quasi-Newton_method), which replaces the expensive computation of the Hessian matrix with an approximation but still enjoys a fast convergence rate like [Newton's method](https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization) where the full Hessian matrix is computed.
76+
/// Since L-BFGS approximation uses only a limited amount of historical states to compute the next step direction, it is especially suited for problems with a high-dimensional feature vector.
77+
/// The number of historical states is a user-specified parameter, using a larger number may lead to a better approximation of the Hessian matrix but also a higher computation cost per step.
7878
///
79-
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data and that prevents overfitting by penalizing model's magnitude usually measured by some norm functions.
80-
/// This can improve the generalization of the model learned by selecting the optimal complexity in the bias-variance trade-off.
81-
/// Regularization works by adding the penalty that is associated with coefficient values to the error of the hypothesis.
79+
/// Regularization is a method that can render an ill-posed problem more tractable by imposing constraints that provide information to supplement the data. It prevents overfitting by penalizing the model's magnitude, usually measured by some norm functions.
80+
/// This can improve the generalization of the model learned by selecting the optimal complexity in the [bias-variance trade-off](https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff).
81+
/// Regularization works by adding a penalty that is associated with coefficient values to the error of the hypothesis.
8282
/// An accurate model with extreme coefficient values would be penalized more, but a less accurate model with more conservative values would be penalized less.
8383
///
8484
/// This learner supports [elastic net regularization](https://en.wikipedia.org/wiki/Elastic_net_regularization): a linear combination of L1-norm (LASSO), $|| \textbf{w} ||_1$, and L2-norm (ridge), $|| \textbf{w} ||_2^2$ regularizations.
8585
/// L1-norm and L2-norm regularizations have different effects and uses that are complementary in certain respects.
8686
/// Using L1-norm can increase sparsity of the trained $\textbf{w}$.
8787
/// When working with high-dimensional data, it shrinks small weights of irrelevant features to 0 and therefore no resource will be spent on those bad features when making prediction.
88-
/// If L1-norm regularization is used, the used training algorithm would be [QWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
88+
/// If L1-norm regularization is used, the training algorithm used would be [OWL-QN](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.68.5260).
8989
/// L2-norm regularization is preferable for data that is not sparse and it largely penalizes the existence of large weights.
9090
///
91-
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables out of the model.
91+
/// An aggressive regularization (that is, assigning large coefficients to L1-norm or L2-norm regularization terms) can harm predictive capacity by excluding important variables from the model.
9292
/// Therefore, choosing the right regularization coefficients is important when applying maximum entropy classifier.
9393
///
9494
/// Check the See Also section for links to usage examples.

0 commit comments

Comments
 (0)