Skip to content

Commit 81d7747

Browse files
alexander-daskalovmaropu
authored andcommitted
[MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request? In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_ ### Why are the changes needed? Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter. ### Does this PR introduce _any_ user-facing change? This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead, ### How was this patch tested? No tests, as no code changes were made. Closes apache#29424 from Comonut/fix-approx_count_distinct-rsd-param-description. Authored-by: alexander-daskalov <[email protected]> Signed-off-by: Takeshi Yamamuro <[email protected]> (cherry picked from commit 10edeaf) Signed-off-by: Takeshi Yamamuro <[email protected]>
1 parent 89765f5 commit 81d7747

File tree

6 files changed

+11
-10
lines changed

6 files changed

+11
-10
lines changed

R/pkg/R/functions.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2199,7 +2199,7 @@ setMethod("pmod", signature(y = "Column"),
21992199
column(jc)
22002200
})
22012201

2202-
#' @param rsd maximum estimation error allowed (default = 0.05).
2202+
#' @param rsd maximum relative standard deviation allowed (default = 0.05).
22032203
#'
22042204
#' @rdname column_aggregate_functions
22052205
#' @aliases approx_count_distinct,Column-method

python/pyspark/sql/functions.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -335,8 +335,8 @@ def approx_count_distinct(col, rsd=None):
335335
"""Aggregate function: returns a new :class:`Column` for approximate distinct count of
336336
column `col`.
337337
338-
:param rsd: maximum estimation error allowed (default = 0.05). For rsd < 0.01, it is more
339-
efficient to use :func:`countDistinct`
338+
:param rsd: maximum relative standard deviation allowed (default = 0.05).
339+
For rsd < 0.01, it is more efficient to use :func:`countDistinct`
340340
341341
>>> df.agg(approx_count_distinct(df.age).alias('distinct_ages')).collect()
342342
[Row(distinct_ages=2)]

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ import org.apache.spark.unsafe.Platform
3939
* and its elements should be sorted into ascending order.
4040
* Duplicate endpoints are allowed, e.g. (1, 5, 5, 10), and ndv for
4141
* interval (5, 5] would be 1.
42-
* @param relativeSD The maximum estimation error allowed in the HyperLogLogPlusPlus algorithm.
42+
* @param relativeSD The maximum relative standard deviation allowed
43+
* in the HyperLogLogPlusPlus algorithm.
4344
*/
4445
case class ApproxCountDistinctForIntervals(
4546
child: Expression,

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,13 @@ import org.apache.spark.sql.types._
4141
* https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen#
4242
*
4343
* @param child to estimate the cardinality of.
44-
* @param relativeSD the maximum estimation error allowed.
44+
* @param relativeSD the maximum relative standard deviation allowed.
4545
*/
4646
// scalastyle:on
4747
@ExpressionDescription(
4848
usage = """
4949
_FUNC_(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
50-
`relativeSD` defines the maximum estimation error allowed.""",
50+
`relativeSD` defines the maximum relative standard deviation allowed.""",
5151
examples = """
5252
Examples:
5353
> SELECT _FUNC_(col1) FROM VALUES (1), (1), (2), (2), (3) tab(col1);

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1578,8 +1578,8 @@ object SQLConf {
15781578
val NDV_MAX_ERROR =
15791579
buildConf("spark.sql.statistics.ndv.maxError")
15801580
.internal()
1581-
.doc("The maximum estimation error allowed in HyperLogLog++ algorithm when generating " +
1582-
"column level statistics.")
1581+
.doc("The maximum relative standard deviation allowed in HyperLogLog++ algorithm " +
1582+
"when generating column level statistics.")
15831583
.version("2.1.1")
15841584
.doubleConf
15851585
.createWithDefault(0.05)

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -262,7 +262,7 @@ object functions {
262262
/**
263263
* Aggregate function: returns the approximate number of distinct items in a group.
264264
*
265-
* @param rsd maximum estimation error allowed (default = 0.05)
265+
* @param rsd maximum relative standard deviation allowed (default = 0.05)
266266
*
267267
* @group agg_funcs
268268
* @since 2.1.0
@@ -274,7 +274,7 @@ object functions {
274274
/**
275275
* Aggregate function: returns the approximate number of distinct items in a group.
276276
*
277-
* @param rsd maximum estimation error allowed (default = 0.05)
277+
* @param rsd maximum relative standard deviation allowed (default = 0.05)
278278
*
279279
* @group agg_funcs
280280
* @since 2.1.0

0 commit comments

Comments
 (0)