Skip to content

feat: support rangebitmap read and write#185

Open
fafacao86 wants to merge 6 commits intoalibaba:mainfrom
fafacao86:rangebitmap-inte
Open

feat: support rangebitmap read and write#185
fafacao86 wants to merge 6 commits intoalibaba:mainfrom
fafacao86:rangebitmap-inte

Conversation

@fafacao86
Copy link
Contributor

Purpose

Linked issue: close #146

Tests

UT in rangebtimap_file_index_test.cpp

  1. functional tests: single-chunk, multi-chunk, different types rangebitmap read and write
  2. different query patterns: EQ GT LT GTE LTE ISNULL, query existing data and non-existing data.
  3. different data patterns: normal numbers mixed with NULLs, floats with -0.0, +0.0, NaN etc.
  4. edge cases: empty rangebitmap, rangebitmap with ONLY NULLs.

IT in paimon::test::ReadInteWithIndexTest::CheckResultForRangeBitmap
data is generated using paimon-java v1.3.1.
Same data, same queries, with single-chunk and multi-chunk, result should be the same.

tests are mainly written by AI, reviewed by human.
test coverage:
range_bitmap_file_index.cpp is a little low(82.7%) is because No write integration test to cover CreateWriter Method.

[range_bitmap.cpp](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap.cpp.gcov.html)	
96.2%96.2%
96.2 %	203 / 211	100.0 %	19 / 19
[range_bitmap.h](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap.h.gcov.html)	
100.0%
100.0 %	5 / 5	100.0 %	2 / 2
[range_bitmap_file_index.cpp](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap_file_index.cpp.gcov.html)	
82.7%82.7%
82.7 %	110 / 133	96.6 %	28 / 29
[range_bitmap_file_index.h](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap_file_index.h.gcov.html)	
100.0%
100.0 %	1 / 1	100.0 %	2 / 2
[range_bitmap_file_index_factory.cpp](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap_file_index_factory.cpp.gcov.html)	
100.0%
100.0 %	4 / 4	100.0 %	4 / 4
[range_bitmap_file_index_factory.h](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap_file_index_factory.h.gcov.html)	
100.0%
100.0 %	2 / 2	100.0 %	1 / 1
[range_bitmap_file_index_test.cpp](http://172.16.25.167:8000/coverage/common/file_index/rangebitmap/range_bitmap_file_index_test.cpp.gcov.html)	
99.7%99.7%
99.7 %	317 / 318	100.0 %	61 / 61

API and Format

Documentation

Generative AI tooling

Generated-by: Kimi K2.5

const Literal& literal) {
return std::make_shared<BitmapIndexResult>(
[self = shared_from_this(), literal]() -> Result<RoaringBitmap32> {
if (!self->range_bitmap_) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit curious — are there any cases where self->range_bitmap_ could be null?
From the current logic, it seems like it should always be initialized as long as no exception occurs during setup.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the constructor of paimon::RangeBitmapFileIndexReader::RangeBitmapFileIndexReader was public before, I changed it to private method later. So the defensive check is unneccessary.

PAIMON_ASSIGN_OR_RAISE(int32_t min_compare, key.CompareTo(min_));
PAIMON_ASSIGN_OR_RAISE(int32_t max_compare, key.CompareTo(max_));
PAIMON_ASSIGN_OR_RAISE(BitSliceIndexBitmap * bit_slice_ptr, this->GetBitSliceIndex());
if (min_compare == 0 && max_compare == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe BitSliceIndexBitmap* bit_slice_ptr?

return RoaringBitmap32();
}
PAIMON_ASSIGN_OR_RAISE(Dictionary * dictionary, this->GetDictionary());
PAIMON_ASSIGN_OR_RAISE(int32_t code, dictionary->Find(key));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style note: for consistency, could we place the pointer * next to the type instead of the variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-commit seems to be a little problem. I wrote Dictionary* dictionary and run pre-commit run -a, the pre-commit changes it to Dictionary * dictionary. I'll modify it and see if it pass the CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clang-format failed. I guess it might be related to the PAIMON_ASSIGN_OR_RAISE macro, clang-format cannot derive if Dictionary* dictionary is wether a varialble declaration or expressoin.
I checked other places in paimon-cpp, src/paimon/core/io/row_to_arrow_array_converter.h:136, auto seems to be the solution to this.

PAIMON_ASSIGN_OR_RAISE(auto* string_builder,
                       CastToTypedBuilder<arrow::StringBuilder>(array_builder));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* might be derived as a mathmatical multiplication, so clang-format put it in the middle?

Copy link
Collaborator

@lxy-9602 lxy-9602 Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. The syntax like Dictionary* dict might have been misread. Code below seems to be ok.

PAIMON_ASSIGN_OR_RAISE(auto* dictionary, this->GetDictionary()); 

@zjw1111 zjw1111 requested a review from Copilot March 20, 2026 10:02
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds RangeBitmap file index support (read/write) and validates it with new unit/integration tests plus embedded test datasets to close #146.

Changes:

  • Implement RangeBitmap file index reader/writer and factory registration.
  • Add comprehensive UTs for RangeBitmap behavior across types and edge-cases, plus IT coverage using PaIOn-generated datasets.
  • Add ORC/Parquet test datasets (single-chunk and multi-chunk) with range-bitmap index metadata.

Reviewed changes

Copilot reviewed 31 out of 51 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/snapshot-1 Adds Parquet multi-chunk snapshot metadata for ITs.
test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/LATEST Adds Parquet multi-chunk latest snapshot pointer.
test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/EARLIEST Adds Parquet multi-chunk earliest snapshot pointer.
test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/schema/schema-0 Adds schema/options enabling range-bitmap with small chunk size for multi-chunk behavior.
test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/README Documents Parquet multi-chunk dataset rows and index config.
test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/snapshot-1 Adds Parquet single-chunk snapshot metadata for ITs.
test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/LATEST Adds Parquet single-chunk latest snapshot pointer.
test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/EARLIEST Adds Parquet single-chunk earliest snapshot pointer.
test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/schema/schema-0 Adds schema/options enabling range-bitmap for single-chunk case.
test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/README Documents Parquet single-chunk dataset rows and index config.
test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/snapshot-1 Adds ORC multi-chunk snapshot metadata for ITs.
test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/LATEST Adds ORC multi-chunk latest snapshot pointer.
test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/EARLIEST Adds ORC multi-chunk earliest snapshot pointer.
test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/schema/schema-0 Adds schema/options enabling range-bitmap + ORC format + small chunk size for multi-chunk behavior.
test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/README Documents ORC multi-chunk dataset rows and index config.
test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/snapshot-1 Adds ORC single-chunk snapshot metadata for ITs.
test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/LATEST Adds ORC single-chunk latest snapshot pointer.
test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/EARLIEST Adds ORC single-chunk earliest snapshot pointer.
test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/schema/schema-0 Adds schema/options enabling range-bitmap + ORC format for single-chunk case.
test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/README Documents ORC single-chunk dataset rows and index config.
test/inte/read_inte_with_index_test.cpp Adds IT assertions for RangeBitmap index across predicates and data patterns (single/multi-chunk).
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_test.cpp Adds UTs that roundtrip writer/reader and validate predicate behavior and edge cases.
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.h Declares factory for RangeBitmap file index.
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.cpp Implements factory creation and registration.
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.h Declares RangeBitmap file index, reader, and writer APIs.
src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.cpp Implements index reader/writer based on RangeBitmap serialization.
src/paimon/common/file_index/rangebitmap/range_bitmap.h Declares RangeBitmap query API and append/serialize builder.
src/paimon/common/file_index/rangebitmap/range_bitmap.cpp Implements RangeBitmap read path and serialization format.
src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h Renames default chunk size constant to match naming conventions.
src/paimon/common/file_index/CMakeLists.txt Adds RangeBitmap sources to build.
src/paimon/CMakeLists.txt Registers new RangeBitmap unit test in test build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

reader->VisitGreaterOrEqual(Literal(static_cast<int32_t>(20))));
CheckResult(gte_20_result, {1, 3, 4, 5, 6});
ASSERT_OK_AND_ASSIGN(auto lte_40_result,
reader->VisitLessOrEqual(Literal(static_cast<int32_t>(40))))
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ASSERT_OK_AND_ASSIGN statement is missing a trailing semicolon, which will fail compilation. Add the missing ; after the macro invocation.

Suggested change
reader->VisitLessOrEqual(Literal(static_cast<int32_t>(40))))
reader->VisitLessOrEqual(Literal(static_cast<int32_t>(40))));

Copilot uses AI. Check for mistakes.
Comment on lines +257 to +259
PAIMON_ASSIGN_OR_RAISE(auto bsi,
BitSliceIndexBitmap::Appender::Create(
0, std::max(static_cast<int32_t>(bitmaps_.size() - 1), 0), pool_));
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bitmaps_.size() - 1 underflows when bitmaps_ is empty (e.g., all-null input), and the subsequent cast from a huge size_t to int32_t is implementation-defined. This can produce an incorrect max-code and potentially lead to oversized allocations. Fix by handling the empty case explicitly (e.g., compute max_code = bitmaps_.empty() ? 0 : static_cast<int32_t>(bitmaps_.size() - 1) before calling Create).

Suggested change
PAIMON_ASSIGN_OR_RAISE(auto bsi,
BitSliceIndexBitmap::Appender::Create(
0, std::max(static_cast<int32_t>(bitmaps_.size() - 1), 0), pool_));
const int32_t max_code =
bitmaps_.empty() ? 0 : static_cast<int32_t>(bitmaps_.size() - 1);
PAIMON_ASSIGN_OR_RAISE(auto bsi,
BitSliceIndexBitmap::Appender::Create(0, max_code, pool_));

Copilot uses AI. Check for mistakes.
Comment on lines +260 to +262
PAIMON_ASSIGN_OR_RAISE(auto dictionary,
ChunkedDictionary::Appender::Create(
factory_, static_cast<int32_t>(chunk_size_bytes_limit_), pool_));
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This narrows chunk_size_bytes_limit_ from int64_t to int32_t without validation. If a caller configures a chunk size > INT32_MAX, this can overflow and break dictionary chunking. Consider validating the value and returning Status::Invalid when it exceeds std::numeric_limits<int32_t>::max() (or update the downstream API to accept int64_t).

Copilot uses AI. Check for mistakes.
Comment on lines +122 to +131
Result<std::shared_ptr<RangeBitmapFileIndexReader>> RangeBitmapFileIndexReader::Create(
const std::shared_ptr<arrow::DataType>& arrow_type, const int32_t start, const int32_t length,
const std::shared_ptr<InputStream>& input_stream, const std::shared_ptr<MemoryPool>& pool) {
PAIMON_ASSIGN_OR_RAISE(FieldType field_type,
FieldTypeUtils::ConvertToFieldType(arrow_type->id()));
PAIMON_ASSIGN_OR_RAISE(std::unique_ptr<RangeBitmap> range_bitmap,
RangeBitmap::Create(input_stream, start, field_type, pool));
return std::shared_ptr<RangeBitmapFileIndexReader>(
new RangeBitmapFileIndexReader(std::move(range_bitmap)));
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length parameter is ignored, so the reader doesn't enforce bounds of the index payload. This can cause reads to run past the intended segment when multiple indexes share an input stream. Consider using length to bound the readable region (e.g., via a bounded/wrapping InputStream) or validate that the parsed sections (header/dictionary/bsi) fit within [start, start + length).

Copilot uses AI. Check for mistakes.
Comment on lines +401 to +404
std::cout << array_status.message() << std::endl;
ASSERT_TRUE(array_status.ok());
CheckResult(path, {split}, /*predicate=*/nullptr, expected_array);
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prints a message even when array_status.ok() is true, which adds noise/flakiness to test logs. Prefer removing the std::cout, or only printing the message on failure (e.g., in the assertion failure message).

Suggested change
std::cout << array_status.message() << std::endl;
ASSERT_TRUE(array_status.ok());
CheckResult(path, {split}, /*predicate=*/nullptr, expected_array);
}
ASSERT_TRUE(array_status.ok()) << array_status.message();
CheckResult(path, {split}, /*predicate=*/nullptr, expected_array);
}
}

Copilot uses AI. Check for mistakes.
}
// Test date type
{
// Test greater than predicate: f0 > 5 -> rows 0,3,4,7 (values 17,7,9,10)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment does not match the code: it references f0 > 5, but the predicate is f4 <= 19725 (DATE). Please update the comment to reflect the actual predicate and expected rows, so the test intent is clear and future maintenance doesn't break expectations.

Suggested change
// Test greater than predicate: f0 > 5 -> rows 0,3,4,7 (values 17,7,9,10)
// Test less or equal DATE predicate: f4 <= 19725 -> row 1 (date value 19725)

Copilot uses AI. Check for mistakes.
Comment on lines +42 to +45
void SetUp() override {
pool_ = GetDefaultPool();
fs_ = std::make_shared<LocalFileSystem>();
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs_ and index_buffer_ are not used anywhere in this test file (they're only initialized/reset). Removing these members (and the corresponding setup/teardown code) will reduce clutter and make the fixture easier to understand.

Copilot uses AI. Check for mistakes.
Comment on lines +102 to +105
private:
std::shared_ptr<FileSystem> fs_;
std::shared_ptr<Bytes> index_buffer_;
};
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fs_ and index_buffer_ are not used anywhere in this test file (they're only initialized/reset). Removing these members (and the corresponding setup/teardown code) will reduce clutter and make the fixture easier to understand.

Copilot uses AI. Check for mistakes.
std::map<std::string, std::string> options;
// Configure a very small chunk size in bytes so that the dictionary must
// be split into multiple chunks.
options[RangeBitmapFileIndex::kChunkSize] = "86b";
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk-size unit uses lowercase b here, while other test data/options use B (e.g., 16B). If MemorySize::ParseBytes is case-sensitive, this will fail. Consider using a consistent canonical unit like \"86B\" across tests/config.

Suggested change
options[RangeBitmapFileIndex::kChunkSize] = "86b";
options[RangeBitmapFileIndex::kChunkSize] = "86B";

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support RangeBitmap File Index

3 participants