Skip to content

Conversation

@konstantinb
Copy link
Contributor

@konstantinb konstantinb commented Dec 18, 2025

What changes were proposed in this pull request?

HIVE-29368: more conservative NDV combining by PessimisticStatCombiner

Why are the changes needed?

These changes prevent severe underestimation of records' statistics, which often lead to query failures on large data sets

Does this PR introduce any user-facing change?

NO

How was this patch tested?

Extensive regression testing in a private fork; new and updated query files in this PR

cs.setNumNulls(csd.getBinaryStats().getNumNulls());
} else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) {
cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp());
cs.setCountDistint(csd.getTimestampStats().getNumDVs());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner

cs.setHistogram(csd.getDecimalStats().getHistogram());
} else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) {
cs.setAvgColLen(JavaDataModel.get().lengthOfDate());
cs.setCountDistint(csd.getDateStats().getNumDVs());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner

@sonarqubecloud
Copy link

// to make the most conservative decisions possible, which is the exact goal of
// PessimisticStatCombiner. It does inflate statistics in multiple cases, but at the same time it
// also ensures than the query execution does not "blow up" due to too optimistic stats estimates
result.setCountDistint(0L);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could appear counter-intuitive at first, however, when combining statistics of different logical branches of the same column, and having no reliable information about their interdependencies (i.e. in a "truly pessimistic" scenario), every other option appears to introduce undesired under-estimations, which often lead to catastrophic query failures.

For example, a simple column generated by an CASE..WHEN clause with three constants produces an NDV of 1 by the original code, while, in most cases, the "true" NDV is 3. If such a column participates in a GROUP BY condition later on, its estimated number of records naturally becomes "1". Even this seemingly small under-estimation could lead to bad decision of converting to a mapjoin or not, especially over large data sets.

Alternatively, trying to "total up" NDV values of the same columns could cause over-estimation of the true NDV of such a column, which, it its turn, could lead to a severe underestimation of records matching an "IN" filter, ultimately producing equally bad results as the previous case

@konstantinb konstantinb marked this pull request as ready for review December 29, 2025 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants