-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29368: more conservative NDV combining by PessimisticStatCombiner #6244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…timestamp/date columns
6a1ccb9 to
8b361de
Compare
| cs.setNumNulls(csd.getBinaryStats().getNumNulls()); | ||
| } else if (colTypeLowerCase.equals(serdeConstants.TIMESTAMP_TYPE_NAME)) { | ||
| cs.setAvgColLen(JavaDataModel.get().lengthOfTimestamp()); | ||
| cs.setCountDistint(csd.getTimestampStats().getNumDVs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner
| cs.setHistogram(csd.getDecimalStats().getHistogram()); | ||
| } else if (colTypeLowerCase.equals(serdeConstants.DATE_TYPE_NAME)) { | ||
| cs.setAvgColLen(JavaDataModel.get().lengthOfDate()); | ||
| cs.setCountDistint(csd.getDateStats().getNumDVs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am unsure if this was deliberately not added or an unintended omission. It does seem to improve stats' calculations of multiple .q test files, especially after more conservative NDV handling by PessimisticStatCombiner
4e98026 to
e7ca1fd
Compare
|
| // to make the most conservative decisions possible, which is the exact goal of | ||
| // PessimisticStatCombiner. It does inflate statistics in multiple cases, but at the same time it | ||
| // also ensures than the query execution does not "blow up" due to too optimistic stats estimates | ||
| result.setCountDistint(0L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could appear counter-intuitive at first, however, when combining statistics of different logical branches of the same column, and having no reliable information about their interdependencies (i.e. in a "truly pessimistic" scenario), every other option appears to introduce undesired under-estimations, which often lead to catastrophic query failures.
For example, a simple column generated by an CASE..WHEN clause with three constants produces an NDV of 1 by the original code, while, in most cases, the "true" NDV is 3. If such a column participates in a GROUP BY condition later on, its estimated number of records naturally becomes "1". Even this seemingly small under-estimation could lead to bad decision of converting to a mapjoin or not, especially over large data sets.
Alternatively, trying to "total up" NDV values of the same columns could cause over-estimation of the true NDV of such a column, which, it its turn, could lead to a severe underestimation of records matching an "IN" filter, ultimately producing equally bad results as the previous case



What changes were proposed in this pull request?
HIVE-29368: more conservative NDV combining by PessimisticStatCombiner
Why are the changes needed?
These changes prevent severe underestimation of records' statistics, which often lead to query failures on large data sets
Does this PR introduce any user-facing change?
NO
How was this patch tested?
Extensive regression testing in a private fork; new and updated query files in this PR