HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

konstantinb · 2025-12-18T18:50:03Z

What changes were proposed in this pull request?

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner

Why are the changes needed?

These changes prevent severe underestimation of records' statistics, which often lead to query failures on large data sets

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Extensive regression testing in a private fork; new and updated query files in this PR

…comment

…r now

…f it is "known"

…timestamp/date columns

konstantinb · 2025-12-24T00:21:00Z

ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java

      cs.setNumNulls(csd.getBinaryStats().getNumNulls());
    } else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
      cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
+      cs.setCountDistint(csd.getTimestampStats().getNumDVs());


I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner

konstantinb · 2025-12-24T00:21:12Z

ql/src/java/org/apache/hadoop/hive/ql/stats/StatsUtils.java

      cs.setHistogram(csd.getDecimalStats().getHistogram());
    } else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) {
      cs.setAvgColLen(JavaDataModel.get().lengthOfDate());
+      cs.setCountDistint(csd.getDateStats().getNumDVs());


I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner

sonarqubecloud · 2025-12-29T18:32:35Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

konstantinb · 2025-12-29T22:27:51Z

ql/src/java/org/apache/hadoop/hive/ql/stats/estimator/PessimisticStatCombiner.java

+      // to make the most conservative decisions possible, which is the exact goal of
+      // PessimisticStatCombiner. It does inflate statistics in multiple cases, but at the same time it
+      // also ensures than the query execution does not "blow up" due to too optimistic stats estimates
+      result.setCountDistint(0L);


This could appear counter-intuitive at first, however, when combining statistics of different logical branches of the same column, and having no reliable information about their interdependencies (i.e. in a "truly pessimistic" scenario), every other option appears to introduce undesired under-estimations, which often lead to catastrophic query failures.

For example, a simple column generated by an CASE..WHEN clause with three constants produces an NDV of 1 by the original code, while, in most cases, the "true" NDV is 3. If such a column participates in a GROUP BY condition later on, its estimated number of records naturally becomes "1". Even this seemingly small under-estimation could lead to bad decision of converting to a mapjoin or not, especially over large data sets.

Alternatively, trying to "total up" NDV values of the same columns could cause over-estimation of the true NDV of such a column, which, it its turn, could lead to a severe underestimation of records matching an "IN" filter, ultimately producing equally bad results as the previous case

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner

633951c

asf-ci-hive added tests pending tests unstable and removed tests pending labels Dec 18, 2025

konstantinb added 2 commits December 18, 2025 17:19

HIVE-29368: regenerated impacted test results + added an explanation …

199c441

…comment

HIVE-29368: one more test file, modified using explain output only fo…

f0022f7

…r now

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 19, 2025

HIVE-29368: only increment ndv by one inextractNDVGroupingColumns() i…

bd86e3c

…f it is "known"

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 19, 2025

HIVE-29368: further tuning NDV handling, including reading stats for …

75dbdf8

…timestamp/date columns

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 19, 2025

Merge origin/master into HIVE-29368

0ddef8c

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Dec 22, 2025

asf-ci-hive added the tests unstable label Dec 23, 2025

HIVE-29368: impacted test .out files

8b361de

konstantinb force-pushed the HIVE-29368 branch from 6a1ccb9 to 8b361de Compare December 23, 2025 22:16

asf-ci-hive added tests pending and removed tests unstable labels Dec 23, 2025

konstantinb commented Dec 24, 2025

View reviewed changes

asf-ci-hive added tests passed and removed tests pending labels Dec 24, 2025

HIVE-29368: Sonar Qube feedback + one more test

1285297

asf-ci-hive added tests pending tests passed tests unstable and removed tests passed tests pending labels Dec 28, 2025

HIVE-29368: a typo in the comment

e7ca1fd

konstantinb force-pushed the HIVE-29368 branch from 4e98026 to e7ca1fd Compare December 29, 2025 17:34

asf-ci-hive added tests pending and removed tests unstable labels Dec 29, 2025

asf-ci-hive added tests unstable and removed tests pending labels Dec 29, 2025

HIVE-29368: attempt a rebuild

5514ca2

asf-ci-hive added tests pending and removed tests unstable labels Dec 29, 2025

konstantinb commented Dec 29, 2025

View reviewed changes

konstantinb marked this pull request as ready for review December 29, 2025 22:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

konstantinb commented Dec 18, 2025 •

edited

Loading

Uh oh!

konstantinb Dec 24, 2025

Uh oh!

konstantinb Dec 24, 2025

Uh oh!

sonarqubecloud bot commented Dec 29, 2025

Uh oh!

konstantinb Dec 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

Are you sure you want to change the base?

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244

Conversation

konstantinb commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

konstantinb Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

konstantinb Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 29, 2025

Quality Gate passed

Uh oh!

konstantinb Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

konstantinb commented Dec 18, 2025 •

edited

Loading