PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393
PARQUET-2249: Add IEEE-754 total order and nan count for floating types#3393wgtmac wants to merge 1 commit intoapache:masterfrom
Conversation
c01b3f3 to
4b7e86b
Compare
| mergeStatisticsMinMax(stats); | ||
| markAsNotEmpty(); | ||
| } | ||
| if (isNanCountSet() && stats.isNanCountSet()) { |
There was a problem hiding this comment.
Minor: should we also validate that both stats have the same column order before merging? Merging IEEE 754 stats with TYPE_DEFINED_ORDER stats could produce incorrect min/max since NaN handling differs. Though type.equals() on line 408 may already cover this if column order is part of type equality.
| if (!Float.isNaN(min_value)) { | ||
| if (Float.isNaN(min) || comparator().compare(min, min_value) > 0) { | ||
| min = min_value; | ||
| } | ||
| } |
There was a problem hiding this comment.
| if (!Float.isNaN(min_value)) { | |
| if (Float.isNaN(min) || comparator().compare(min, min_value) > 0) { | |
| min = min_value; | |
| } | |
| } | |
| if (!Float.isNaN(min_value)) { | |
| if (Float.isNaN(min) || comparator().compare(min, min_value) > 0) { | |
| min = min_value; | |
| } | |
| } else if (Float.isNan(min) && comparator().compare(min, min_value) > 0) { | |
| min = min_value; | |
| } |
IIUC the wording of the most recent proposal is that the statistics must contain the min and max NaN values for an all-NaN page/chunk. I think you need to still update the stats for an incoming NaN if the current min/max is NaN.
There was a problem hiding this comment.
There was a comment from @rdblue to simply use a standard NaN: apache/parquet-format#514 (comment)
IMO, currently NaN values are just sentinels for min/max to indicate a all-NaN page/chunk. We should be conservative not to depend on the order of NaN values for filtering.
There was a problem hiding this comment.
Yes, there's the comment, but the wording hasn't yet changed. And given the other comments in that thread I'm not sure it will be changed. But at this stage in the process I guess it doesn't matter.
Rationale for this change
PoC implementation for the spec change: apache/parquet-format#514
What changes are included in this PR?
Are these changes tested?
Added various test cases for both new metadata and filtering.
Are there any user-facing changes?
Yes, users can now set the new column order but by default it is not used.
Closes #406