[SPARK-23975][ML]Allow Clustering to take Arrays of Double as input features #21081

lu-wang-dl · 2018-04-16T18:00:43Z

What changes were proposed in this pull request?

Multiple possible input types is added in validateAndTransformSchema() and computeCost() while checking column type
Add if statement in transform() to support array type as featuresCol
Add the case statement in fit() while selecting columns from dataset

These changes will be applied to KMeans first, then to other clustering method

How was this patch tested?

unit test is added

Please review http://spark.apache.org/contributing.html before opening a pull request.

jkbradley

Thanks for the PR! I made a first review pass

jkbradley · 2018-04-16T18:28:07Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+
+    assert(kmeans.getK === 2)
+    assert(kmeans.getFeaturesCol === featuresColName)
+    assert(kmeans.getPredictionCol === "prediction")


No need to check this or the other Params which are not relevant to this test

jkbradley · 2018-04-16T18:28:36Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+    val arrayUDF = udf { (features: Vector) =>
+      features.toArray
+    }
+    val newdataset = dataset.withColumn(featuresColName, arrayUDF(col("features")) )


nit: You could drop the original column as well just to make extra sure that it's not being accidentally used.

jkbradley · 2018-04-16T18:30:40Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+    assert(kmeans.getDistanceMeasure === DistanceMeasure.EUCLIDEAN)
+    val model = kmeans.setMaxIter(1).fit(newdataset)
+
+    MLTestingUtils.checkCopyAndUids(kmeans, model)


You don't need this test here

ditto for hasSummary and copying

jkbradley · 2018-04-16T18:48:18Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

-    val predictUDF = udf((vector: Vector) => predict(vector))
-    dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))
+    // val predictUDF = udf((vector: Vector) => predict(vector))
+    if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {


tip: This can be more succinct if written as:

val predictUDF = if (dataset.schema(...).dataType.equals(...)) { A } else { B } dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol)))) // so this line is only written once

jkbradley · 2018-04-16T18:51:24Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -312,6 +329,8 @@ class KMeans @Since("1.5.0") (
    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    val instances: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map {
      case Row(point: Vector) => OldVectors.fromML(point)
+      case Row(point: Seq[_]) =>
+        OldVectors.fromML(Vectors.dense(point.asInstanceOf[Seq[Double]].toArray))


I'm not sure this will work with arrays of FloatType. Make sure to test it

jkbradley · 2018-04-16T18:51:40Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+      dataset.withColumn($(predictionCol), predictUDF(col($(featuresCol))))
+    } else {
+      val predictUDF = udf((vector: Seq[_]) =>
+        predict(Vectors.dense(vector.asInstanceOf[Seq[Double]].toArray)))


This may not work with arrays of FloatType.

SparkQA · 2018-04-16T19:07:26Z

Test build #89410 has finished for PR 21081 at commit badb0cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-16T23:19:51Z

Test build #89417 has finished for PR 21081 at commit 6d222a3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the updates!

jkbradley · 2018-04-17T20:30:09Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+    // val predictUDF = udf((vector: Vector) => predict(vector))
+    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+      udf((vector: Vector) => predict(vector))
+    }


Scala style: } else {

jkbradley · 2018-04-17T20:30:47Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+      udf((vector: Vector) => predict(vector))
+    }
+    else {
+      udf((vector: Seq[_]) => {


scala style: remove unnecessary { at end of line (IntelliJ should warn you about this)

jkbradley · 2018-04-17T20:33:22Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -90,7 +90,12 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe
   * @return output schema
   */
  protected def validateAndTransformSchema(schema: StructType): StructType = {
-    SchemaUtils.checkColumnType(schema, $(featuresCol), new VectorUDT)
+    val typeCandidates = List( new VectorUDT,
+      new ArrayType(DoubleType, true),


Thinking about this, let's actually disallow nullable columns. KMeans won't handle nulls properly.

Also, IntelliJ may warn you about passing boolean arguments as named arguments; that'd be nice to fix here.

jkbradley · 2018-04-17T20:34:57Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+    }
+    else {
+      udf((vector: Seq[_]) => {
+        val featureArray = Array.fill[Double](vector.size)(0.0)


You shouldn't have to do the conversion in this convoluted (and less efficient) way. I'd recommend doing a match-case statement on dataset.schema; I think that will be the most succinct. Then you can handle Vector, Seq of Float, and Seq of Double separately, without conversions to strings.

Same for the similar cases below.

Here's what I meant:

val predictUDF = featuresDataType match { case _: VectorUDT => udf((vector: Vector) => predict(vector)) case fdt: ArrayType => fdt.elementType match { case _: FloatType => ??? case _: DoubleType => ??? } }

jkbradley · 2018-04-17T23:17:22Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

-    val predictUDF = udf((vector: Vector) => predict(vector))
+    // val predictUDF = udf((vector: Vector) => predict(vector))
+    val predictUDF = if (dataset.schema($(featuresCol)).dataType.equals(new VectorUDT)) {
+      udf((vector: Vector) => predict(vector))


Side note: I realized that "predict" will cause the whole model to be serialized and sent to workers. But that's actually OK since we do need to send most of the model data to make predictions and since there's not a clean way to just sent the model weights. So I think my previous comment about copying "numClasses" to a local variable was not necessary. Don't bother reverting the change though.

WeichenXu123 · 2018-04-18T23:23:53Z

@jkbradley Will this be applied to other algos besides clustering algos ? and how to support sparse float features ?

jkbradley · 2018-04-19T00:01:05Z

I hope we can apply it to other algs too. @ludatabricks is doing some refactoring which should make that easier, but we're not going for a completely general approach right away.

I don't think we need to worry about sparse FloatType features; users have no way to pass those in.

WeichenXu123 · 2018-04-19T07:13:41Z

So why not design generic vector class ? and then implement Vector[Double] and Vector[Float] via generic specification ? So it can support everything, no matter sparse and dense.

SparkQA · 2018-04-19T19:44:25Z

Test build #89585 has finished for PR 21081 at commit 009b918.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2018-04-20T01:10:13Z

@WeichenXu123 A generic vector class would be interesting, but that would be a big project, way out of scope of this PR. You could bring it up if that person on the dev list sends a SPIP about linear algebra.

jkbradley

A few more comments. (Also, remember to clean up the commented-out code.)

jkbradley · 2018-04-20T01:13:48Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -120,11 +123,32 @@ class KMeansModel private[ml] (
  @Since("2.0.0")
  def setPredictionCol(value: String): this.type = set(predictionCol, value)

+  @Since("2.4.0")
+  def featureToVector(dataset: Dataset[_], col: Column): Column = {


Make this private. In general, we try to keep APIs as private as possible since that allows us more flexibility to make changes in the future.

Also, add a Scala docstring saying what this does.

jkbradley · 2018-04-20T01:14:36Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+  def featureToVector(dataset: Dataset[_], col: Column): Column = {
+    val featuresDataType = dataset.schema(getFeaturesCol).dataType
+    val transferUDF = featuresDataType match {
+      case _: VectorUDT => udf((vector: Vector) => vector)


Just return col(getFeaturesCol) since that will be more efficient. (Calling a UDF requires data serialization overhead.)

jkbradley · 2018-04-20T01:18:48Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -305,15 +344,45 @@ class KMeans @Since("1.5.0") (
  @Since("1.5.0")
  def setSeed(value: Long): this.type = set(seed, value)

+  @Since("2.4.0")
+  def featureToVector(dataset: Dataset[_], col: Column): Column = {


Is this a copy of the same method? It should be shared, either in KMeansParams or in a static (object) method.

jkbradley · 2018-04-20T01:20:21Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

@@ -144,8 +168,23 @@ class KMeansModel private[ml] (
  // TODO: Replace the temp fix when we have proper evaluators defined for clustering.
  @Since("2.0.0")
  def computeCost(dataset: Dataset[_]): Double = {
-    SchemaUtils.checkColumnType(dataset.schema, $(featuresCol), new VectorUDT)
-    val data: RDD[OldVector] = dataset.select(col($(featuresCol))).rdd.map {
+    val typeCandidates = List( new VectorUDT,


You can reuse validateAndTransformSchema here.

@jkbradley

add validateSchema and use it in computeCost addressed the comments from @jkbradley

SparkQA · 2018-04-20T20:34:57Z

Test build #89660 has finished for PR 21081 at commit cd988c7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…3975

…llow array input

SparkQA · 2018-04-23T21:36:39Z

Test build #89738 has finished for PR 21081 at commit fee36ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-23T22:14:50Z

Test build #89743 has finished for PR 21081 at commit 3e012fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the updates! After these small fixes, this should be ready, and then we can continue with other algorithms.

jkbradley · 2018-04-23T22:26:56Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+      new ArrayType(DoubleType, false),
+      new ArrayType(FloatType, false))
+    SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
+  }
  /**


scala style: always put newline between methods

Ping: There needs to be a newline between the "}" of the previous method and the "/**" Scaladoc of the next method. Please start checking for this.

jkbradley · 2018-04-23T22:29:22Z

mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala

+   * @param colName column name for features
+   * @return Vector feature column
+   */
+  @Since("2.4.0")


Don't add Since annotations to private APIs. They can get Since annotations when they are made public.

jkbradley · 2018-04-23T22:31:34Z

mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala

+private[spark] object DatasetUtils {
+
+  /**
+   * preprocessing the input feature column to Vector


This is a bit unclear. How about: "Cast a column in a Dataset to a Vector type."
Also, this isn't specific to features, so please clarify that below.
Finally, the key thing to document is the list of supported input types, so I'd add that.

jkbradley · 2018-04-23T22:35:45Z

mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala

+   * preprocessing the input feature column to Vector
+   * @param dataset DataFrame with columns for features
+   * @param colName column name for features
+   * @return Vector feature column


Add a note that this returned Column does not have Metadata

jkbradley · 2018-04-23T22:37:46Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+    assert(newdatasetD.schema(featuresColNameD).dataType.equals(new ArrayType(DoubleType, false)))
+    assert(newdatasetF.schema(featuresColNameF).dataType.equals(new ArrayType(FloatType, false)))
+
+    val kmeansD = new KMeans().setK(k).setFeaturesCol(featuresColNameD).setSeed(1)


Also do: setMaxIter(1) to make this a little faster.

jkbradley · 2018-04-23T22:38:07Z

mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala

+    assert(predictDifference.count() == 0)
+
+    assert(modelD.computeCost(newdatasetD) == modelF.computeCost(newdatasetF) )
+


nit: remove unnecessary newline

SparkQA · 2018-04-24T00:09:39Z

Test build #89736 has finished for PR 21081 at commit 3ffb322.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2018-04-24T01:08:48Z

Test build #89749 has finished for PR 21081 at commit c4e1a51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-04-24T01:30:38Z

Test build #4157 has finished for PR 21081 at commit c4e1a51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Just the 1 style comment. Can you please fix it in the follow-up PR? I'll go ahead and merge this with master
LGTM
Thanks @ludatabricks !

jkbradley · 2018-04-24T16:21:56Z

mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala

+      new ArrayType(DoubleType, false),
+      new ArrayType(FloatType, false))
+    SchemaUtils.checkColumnTypes(schema, $(featuresCol), typeCandidates)
+  }
  /**


Ping: There needs to be a newline between the "}" of the previous method and the "/**" Scaladoc of the next method. Please start checking for this.

jkbradley · 2018-04-24T16:23:48Z

mllib/src/main/scala/org/apache/spark/ml/util/DatasetUtils.scala

          })
          case _: DoubleType => udf((vector: Seq[Double]) => {
            Vectors.dense(vector.toArray)
          })
+          case other =>


Thanks! I forgot about this since this was generalized.

lu-wang-dl added 2 commits April 16, 2018 10:32

add Array input support for KMeans

ed890d3

remove redundent code

badb0cc

jkbradley reviewed Apr 16, 2018

View reviewed changes

make sure the code works for Float type and add the unit test

6d222a3

jkbradley reviewed Apr 17, 2018

View reviewed changes

consolidating featuretovector

009b918

jkbradley reviewed Apr 20, 2018

View reviewed changes

change featureToVector to KMeanParams and add the scala docs

cd988c7

add validateSchema and use it in computeCost addressed the comments from @jkbradley

lu-wang-dl added 3 commits April 23, 2018 13:23

change featureToVector and validateSchema to private functions

3ffb322

Merge branch 'master' of https://github.com/apache/spark into SPARK-2…

fee36ad

…3975

move featureToVector to util, so that other methods could use it to a…

3e012fb

…llow array input

jkbradley reviewed Apr 23, 2018

View reviewed changes

fix the comments and correct code style

c4e1a51

jkbradley reviewed Apr 24, 2018

View reviewed changes

asfgit closed this in 2a24c48 Apr 24, 2018

lu-wang-dl deleted the SPARK-23975 branch April 25, 2018 18:51

		assert(predictDifference.count() == 0)

		assert(modelD.computeCost(newdatasetD) == modelF.computeCost(newdatasetF) )

[SPARK-23975][ML]Allow Clustering to take Arrays of Double as input features #21081

[SPARK-23975][ML]Allow Clustering to take Arrays of Double as input features #21081

Uh oh!

Conversation

lu-wang-dl commented Apr 16, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 16, 2018

Uh oh!

SparkQA commented Apr 16, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 commented Apr 18, 2018

Uh oh!

jkbradley commented Apr 19, 2018

Uh oh!

WeichenXu123 commented Apr 19, 2018

Uh oh!

SparkQA commented Apr 19, 2018

Uh oh!

jkbradley commented Apr 20, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Apr 20, 2018

Uh oh!

SparkQA commented Apr 23, 2018

Uh oh!

SparkQA commented Apr 23, 2018

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!