Skip to content

Commit 9a0272f

Browse files
hhbyyhsrowen
authored andcommitted
[SPARK-6177][MLlib]Add note in LDA example to remind possible coalesce
JIRA: https://issues.apache.org/jira/browse/SPARK-6177 Add comment to introduce coalesce to LDA example to avoid the possible massive partitions from `sc.textFile`. sc.textFile will create RDD with one partition for each file, and the possible massive partitions downgrades LDA performance. Author: Yuhao Yang <[email protected]> Closes apache#4899 from hhbyyh/adjustPartition and squashes the following commits: a499630 [Yuhao Yang] update comment 9a2d7b6 [Yuhao Yang] move to comment f7fd5d4 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into adjustPartition 26a564a [Yuhao Yang] add coalesce to LDAExample
1 parent 8767565 commit 9a0272f

File tree

1 file changed

+3
-1
lines changed

1 file changed

+3
-1
lines changed

examples/src/main/scala/org/apache/spark/examples/mllib/LDAExample.scala

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,9 @@ object LDAExample {
173173
stopwordFile: String): (RDD[(Long, Vector)], Array[String], Long) = {
174174

175175
// Get dataset of document texts
176-
// One document per line in each text file.
176+
// One document per line in each text file. If the input consists of many small files,
177+
// this can result in a large number of small partitions, which can degrade performance.
178+
// In this case, consider using coalesce() to create fewer, larger partitions.
177179
val textRDD: RDD[String] = sc.textFile(paths.mkString(","))
178180

179181
// Split text into words

0 commit comments

Comments
 (0)