Overview of the package
At the top level, the package exposes three main abstract classes: a Transformer, an Estimator, and a Pipeline. We will shortly explain each with some short examples. We will provide more concrete examples of some of the models in the last section of this chapter.
Transformer
The Transformer class, like the name suggests, transforms your data by (normally) appending a new column to your DataFrame.
At the high level, when deriving from the Transformer abstract class, each and every new Transformer needs to implement a .transform(...) method. The method, as a first and normally the only obligatory parameter, requires passing a DataFrame to be transformed. This, of course, varies method-by-method in the ML package: other popular parameters are inputCol and outputCol; these, however, frequently default to some predefined values, such as, for example, 'features' for the inputCol parameter.
There are many Transformers offered in the spark.ml.feature and we will briefly describe...