[Fix](variance) Fix sample variance/stddev NaN res for single value#63605
Conversation
|
/review |
|
run buildall |
There was a problem hiding this comment.
Review result: no blocking issues found. Critical checkpoints: goal is clear and implemented: sample variance/stddev now return NULL for zero or one non-null input, including grouped results; tests cover existing nullable aggregate suites plus a single-row aggregate case. Scope is small and focused on the two sample aggregate functions plus expected output changes. Concurrency/lifecycle: no new shared state, threads, locks, or lifecycle-sensitive objects. Configuration/compatibility: no configs added; BE function compatibility restriction remains in place and aggregate state serialization is unchanged. Parallel paths: variance_samp, var_samp alias, and stddev_samp paths are covered; population variance/stddev behavior is intentionally unchanged. Error handling/memory: no ignored Status paths or new owned allocations. Data correctness: nullable inputs are skipped and result nullability now matches SQL sample aggregate semantics for count <= 1. Tests: regression expectations were updated; I did not run the regression suite in this review runner. User focus: no additional user-provided review focus.
TPC-H: Total hot run time: 31648 ms |
TPC-DS: Total hot run time: 172613 ms |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review opinion: no blocking issues found in the PR diff.
Critical checkpoint conclusions:
- Goal/test proof: The PR changes
VAR_SAMP/VARIANCE_SAMP/STDDEV_SAMPto return NULL for zero or one valid input and updates aggregate/window/foreach regression outputs covering the changed behavior. - Scope: The actual GitHub PR diff is focused on sample variance/stddev return nullability plus expected results.
- Concurrency/lifecycle: No new concurrency, locking, thread lifecycle, or static initialization concerns were introduced in the reviewed diff.
- Configuration/compatibility: No new configuration items. BE already registers
variance_samp/stddev_sampas restricted function compatibility paths; aggregate state serialization is unchanged. - Parallel paths: BE execution, Nereids FE nullability, aliases through existing registration, window outputs, and foreach outputs were considered. I did not find a missed parallel path in the PR diff.
- Conditional checks: The
count <= 1result-null condition matches the sample variance/stddev definition and null inputs continue to be ignored. - Tests/results: Regression expected outputs were updated for normal aggregate, nullable aggregate, window, MV, and foreach cases. I did not run the tests locally in this review runner.
- Observability: No new observability needed for this scalar aggregate behavior change.
- Transaction/persistence/data writes: Not applicable.
- FE-BE variable passing/protocol: No new transmitted variables or thrift/protocol changes.
- Performance: The added nullable-column check is local to aggregate add and avoids wrapper overhead for these functions; no obvious performance regression found.
User focus: No additional user-provided review focus was present.
TPC-H: Total hot run time: 31169 ms |
TPC-DS: Total hot run time: 171137 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
68e902d to
dd10b05
Compare
|
run buildall |
|
/review |
TPC-H: Total hot run time: 31832 ms |
There was a problem hiding this comment.
I found a blocking correctness issue. The PR goal is to return NULL for VAR_SAMP/VARIANCE_SAMP/STDDEV_SAMP when the valid input count is <= 1, but the BE implementation and expected files now return NaN for the one-valid-value case.
Critical checkpoint conclusions:
- Goal/test: The PR adds regression coverage for the one-row case, but the expected result verifies NaN rather than the stated NULL behavior, so it does not prove the intended fix.
- Scope: The code change is small, but it works around result nullability by emitting a floating NaN instead of implementing nullable results for this semantic case.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, config, storage format, or FE-BE protocol concerns found in the touched logic.
- Parallel paths: Aggregate and window expected outputs are updated consistently, but they consistently encode the wrong value for single valid rows.
- Tests/results: Existing result files were regenerated around this behavior, but the key new expectation is incorrect for SQL NULL semantics.
- Observability/transactions/data writes: Not applicable.
- Performance: No material performance concern found.
User focus: No additional user-provided review focus was supplied.
TPC-DS: Total hot run time: 172222 ms |
dd10b05 to
ff22897
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review completed for PR 63605.
Findings: I did not identify any new inline issue that is distinct from the existing review thread on be/src/exprs/aggregate/aggregate_function_stddev.h about returning floating-point NaN for VAR_SAMP/VARIANCE_SAMP/STDDEV_SAMP when the valid-input count is <= 1. I did not duplicate that thread.
Critical checkpoint conclusions:
- Goal and tests: The PR consistently changes BE sample variance/stddev output and updates regression outputs/adds a single-value aggregate case. Existing review context still covers the main semantic question of
NaNvs SQLNULL. - Scope: The source change is small and focused; many expected-file changes appear to be generated output updates.
- Concurrency/lifecycle/configuration: No new concurrency, lifecycle, static initialization, or configuration concerns found.
- Compatibility: No storage-format or FE/BE protocol compatibility changes found. Function registration and return type remain
Float64/Double. - Parallel paths: The aliases and window/regression output paths affected by the BE implementation were considered; no additional distinct missed implementation path found.
- Error handling/memory: No ignored
Status, exception-boundary, or memory-tracking issue found in this change. - Data correctness: The only substantive correctness concern observed is the already-known
NaNvs SQLNULLsemantics thread; no separate data visibility, transaction, or persistence concern applies. - Test coverage: Regression outputs cover aggregate/window/foreach-style cases, including the added single-row aggregate case. I did not run tests in this review.
- Observability/performance: No new observability need or meaningful performance regression found.
User focus: No additional user-provided review focus was present in .opencode-review.hAcyGz/review_focus.txt.
TPC-H: Total hot run time: 31135 ms |
TPC-DS: Total hot run time: 172074 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
| auto& col = assert_cast<ColumnFloat64&>(to); | ||
| if (this->count == 1 || this->count == 0) { | ||
| col.insert_default(); | ||
| col.get_data().push_back(std::nan("")); |
There was a problem hiding this comment.
我记得好像这个推荐用 std::numeric_limits::quiet_NaN() ?
|
PR approved by at least one committer and no changes requested. |
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review Summary
No additional distinct inline issues found in this review. I intentionally did not re-submit the existing thread about VAR_SAMP/VARIANCE_SAMP/STDDEV_SAMP returning NaN versus SQL NULL, because it is already present in the review context.
Critical checkpoint conclusions:
- Goal/test: The code change implements the PR-stated goal of returning
NaNfor sample variance/stddev when the valid input count is<= 1, and the changed regression outputs plus the added single-row aggregate query cover that behavior. - Scope: The code modification is small and focused in
aggregate_function_stddev.h; most other changes are expected-result updates. - Concurrency/lifecycle: No new concurrency, locking, static initialization, or non-obvious lifecycle management was introduced.
- Config/compatibility: No new config items or storage/protocol format changes were introduced.
- Parallel paths: The changed behavior is applied to
variance_samp/var_sampandstddev_samp; population variants remain unchanged. - Conditional checks: The new
count == 1 || count == 0behavior is direct and matches the PR description. - Test coverage: Existing aggregate, window, foreach, MV, nullable, and Nereids output cases were updated; a direct single-row aggregate query was added.
- Observability/transactions/data writes: Not applicable for this aggregate result change.
- FE/BE variable passing: Not applicable.
- Performance: No material performance concern found; this only changes final result insertion for sample aggregate states.
- User focus: The focus file says there are no additional user-provided focus points.
Residual risk: The already-known semantic concern about using non-null NaN instead of SQL NULL remains a product/compatibility decision and was not duplicated as a new comment.
TPC-H: Total hot run time: 31971 ms |
TPC-DS: Total hot run time: 172055 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
…pache#63605) Problem Summary: Fix `VAR_SAMP`, `VARIANCE_SAMP`, and `STDDEV_SAMP` to return `NaN` when the number of valid input values is less than or equal to 1. Sample variance/stddev are undefined for `n <= 1`, so returning `0.0` is misleading. before: ```sql CREATE TABLE t (id INT, v DOUBLE) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES('replication_num'='1'); INSERT INTO t VALUES (1, 5.0); -- 单行 SELECT VAR_SAMP(v), STDDEV_SAMP(v) FROM t; +-------------+----------------+ | VAR_SAMP(v) | STDDEV_SAMP(v) | +-------------+----------------+ | 0 | 0 | +-------------+----------------+ ``` now: ```sql SELECT VAR_SAMP(v), STDDEV_SAMP(v) FROM t; +-------------+----------------+ | VAR_SAMP(v) | STDDEV_SAMP(v) | +-------------+----------------+ | NaN | NaN | +-------------+----------------+ ``` doc: apache/doris-website#3765
…pache#63605) Problem Summary: Fix `VAR_SAMP`, `VARIANCE_SAMP`, and `STDDEV_SAMP` to return `NaN` when the number of valid input values is less than or equal to 1. Sample variance/stddev are undefined for `n <= 1`, so returning `0.0` is misleading. before: ```sql CREATE TABLE t (id INT, v DOUBLE) DUPLICATE KEY(id) DISTRIBUTED BY HASH(id) BUCKETS 1 PROPERTIES('replication_num'='1'); INSERT INTO t VALUES (1, 5.0); -- 单行 SELECT VAR_SAMP(v), STDDEV_SAMP(v) FROM t; +-------------+----------------+ | VAR_SAMP(v) | STDDEV_SAMP(v) | +-------------+----------------+ | 0 | 0 | +-------------+----------------+ ``` now: ```sql SELECT VAR_SAMP(v), STDDEV_SAMP(v) FROM t; +-------------+----------------+ | VAR_SAMP(v) | STDDEV_SAMP(v) | +-------------+----------------+ | NaN | NaN | +-------------+----------------+ ``` doc: apache/doris-website#3765
Problem Summary:
Fix
VAR_SAMP,VARIANCE_SAMP, andSTDDEV_SAMPto returnNaNwhen the number of valid input values is less than or equal to 1. Sample variance/stddev are undefined forn <= 1, so returning0.0is misleading.before:
now:
doc: apache/doris-website#3765