feat: support rangebitmap read and write#185
Conversation
| const Literal& literal) { | ||
| return std::make_shared<BitmapIndexResult>( | ||
| [self = shared_from_this(), literal]() -> Result<RoaringBitmap32> { | ||
| if (!self->range_bitmap_) { |
There was a problem hiding this comment.
I'm a bit curious — are there any cases where self->range_bitmap_ could be null?
From the current logic, it seems like it should always be initialized as long as no exception occurs during setup.
There was a problem hiding this comment.
You're right, the constructor of paimon::RangeBitmapFileIndexReader::RangeBitmapFileIndexReader was public before, I changed it to private method later. So the defensive check is unneccessary.
| PAIMON_ASSIGN_OR_RAISE(int32_t min_compare, key.CompareTo(min_)); | ||
| PAIMON_ASSIGN_OR_RAISE(int32_t max_compare, key.CompareTo(max_)); | ||
| PAIMON_ASSIGN_OR_RAISE(BitSliceIndexBitmap * bit_slice_ptr, this->GetBitSliceIndex()); | ||
| if (min_compare == 0 && max_compare == 0) { |
There was a problem hiding this comment.
Maybe BitSliceIndexBitmap* bit_slice_ptr?
| return RoaringBitmap32(); | ||
| } | ||
| PAIMON_ASSIGN_OR_RAISE(Dictionary * dictionary, this->GetDictionary()); | ||
| PAIMON_ASSIGN_OR_RAISE(int32_t code, dictionary->Find(key)); |
There was a problem hiding this comment.
Minor style note: for consistency, could we place the pointer * next to the type instead of the variable?
There was a problem hiding this comment.
The pre-commit seems to be a little problem. I wrote Dictionary* dictionary and run pre-commit run -a, the pre-commit changes it to Dictionary * dictionary. I'll modify it and see if it pass the CI.
There was a problem hiding this comment.
The clang-format failed. I guess it might be related to the PAIMON_ASSIGN_OR_RAISE macro, clang-format cannot derive if Dictionary* dictionary is wether a varialble declaration or expressoin.
I checked other places in paimon-cpp, src/paimon/core/io/row_to_arrow_array_converter.h:136, auto seems to be the solution to this.
PAIMON_ASSIGN_OR_RAISE(auto* string_builder,
CastToTypedBuilder<arrow::StringBuilder>(array_builder));
There was a problem hiding this comment.
* might be derived as a mathmatical multiplication, so clang-format put it in the middle?
There was a problem hiding this comment.
Yes, you're right. The syntax like Dictionary* dict might have been misread. Code below seems to be ok.
PAIMON_ASSIGN_OR_RAISE(auto* dictionary, this->GetDictionary()); There was a problem hiding this comment.
Pull request overview
Adds RangeBitmap file index support (read/write) and validates it with new unit/integration tests plus embedded test datasets to close #146.
Changes:
- Implement RangeBitmap file index reader/writer and factory registration.
- Add comprehensive UTs for RangeBitmap behavior across types and edge-cases, plus IT coverage using PaIOn-generated datasets.
- Add ORC/Parquet test datasets (single-chunk and multi-chunk) with range-bitmap index metadata.
Reviewed changes
Copilot reviewed 31 out of 51 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/snapshot-1 | Adds Parquet multi-chunk snapshot metadata for ITs. |
| test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/LATEST | Adds Parquet multi-chunk latest snapshot pointer. |
| test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/EARLIEST | Adds Parquet multi-chunk earliest snapshot pointer. |
| test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/schema/schema-0 | Adds schema/options enabling range-bitmap with small chunk size for multi-chunk behavior. |
| test/test_data/parquet/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/README | Documents Parquet multi-chunk dataset rows and index config. |
| test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/snapshot-1 | Adds Parquet single-chunk snapshot metadata for ITs. |
| test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/LATEST | Adds Parquet single-chunk latest snapshot pointer. |
| test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/EARLIEST | Adds Parquet single-chunk earliest snapshot pointer. |
| test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/schema/schema-0 | Adds schema/options enabling range-bitmap for single-chunk case. |
| test/test_data/parquet/append_with_rangebitmap.db/append_with_rangebitmap/README | Documents Parquet single-chunk dataset rows and index config. |
| test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/snapshot-1 | Adds ORC multi-chunk snapshot metadata for ITs. |
| test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/LATEST | Adds ORC multi-chunk latest snapshot pointer. |
| test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/snapshot/EARLIEST | Adds ORC multi-chunk earliest snapshot pointer. |
| test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/schema/schema-0 | Adds schema/options enabling range-bitmap + ORC format + small chunk size for multi-chunk behavior. |
| test/test_data/orc/append_with_rangebitmap_multi_chunk.db/append_with_rangebitmap_multi_chunk/README | Documents ORC multi-chunk dataset rows and index config. |
| test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/snapshot-1 | Adds ORC single-chunk snapshot metadata for ITs. |
| test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/LATEST | Adds ORC single-chunk latest snapshot pointer. |
| test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/snapshot/EARLIEST | Adds ORC single-chunk earliest snapshot pointer. |
| test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/schema/schema-0 | Adds schema/options enabling range-bitmap + ORC format for single-chunk case. |
| test/test_data/orc/append_with_rangebitmap.db/append_with_rangebitmap/README | Documents ORC single-chunk dataset rows and index config. |
| test/inte/read_inte_with_index_test.cpp | Adds IT assertions for RangeBitmap index across predicates and data patterns (single/multi-chunk). |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_test.cpp | Adds UTs that roundtrip writer/reader and validate predicate behavior and edge cases. |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.h | Declares factory for RangeBitmap file index. |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index_factory.cpp | Implements factory creation and registration. |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.h | Declares RangeBitmap file index, reader, and writer APIs. |
| src/paimon/common/file_index/rangebitmap/range_bitmap_file_index.cpp | Implements index reader/writer based on RangeBitmap serialization. |
| src/paimon/common/file_index/rangebitmap/range_bitmap.h | Declares RangeBitmap query API and append/serialize builder. |
| src/paimon/common/file_index/rangebitmap/range_bitmap.cpp | Implements RangeBitmap read path and serialization format. |
| src/paimon/common/file_index/rangebitmap/dictionary/key_factory.h | Renames default chunk size constant to match naming conventions. |
| src/paimon/common/file_index/CMakeLists.txt | Adds RangeBitmap sources to build. |
| src/paimon/CMakeLists.txt | Registers new RangeBitmap unit test in test build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| reader->VisitGreaterOrEqual(Literal(static_cast<int32_t>(20)))); | ||
| CheckResult(gte_20_result, {1, 3, 4, 5, 6}); | ||
| ASSERT_OK_AND_ASSIGN(auto lte_40_result, | ||
| reader->VisitLessOrEqual(Literal(static_cast<int32_t>(40)))) |
There was a problem hiding this comment.
This ASSERT_OK_AND_ASSIGN statement is missing a trailing semicolon, which will fail compilation. Add the missing ; after the macro invocation.
| reader->VisitLessOrEqual(Literal(static_cast<int32_t>(40)))) | |
| reader->VisitLessOrEqual(Literal(static_cast<int32_t>(40)))); |
| PAIMON_ASSIGN_OR_RAISE(auto bsi, | ||
| BitSliceIndexBitmap::Appender::Create( | ||
| 0, std::max(static_cast<int32_t>(bitmaps_.size() - 1), 0), pool_)); |
There was a problem hiding this comment.
bitmaps_.size() - 1 underflows when bitmaps_ is empty (e.g., all-null input), and the subsequent cast from a huge size_t to int32_t is implementation-defined. This can produce an incorrect max-code and potentially lead to oversized allocations. Fix by handling the empty case explicitly (e.g., compute max_code = bitmaps_.empty() ? 0 : static_cast<int32_t>(bitmaps_.size() - 1) before calling Create).
| PAIMON_ASSIGN_OR_RAISE(auto bsi, | |
| BitSliceIndexBitmap::Appender::Create( | |
| 0, std::max(static_cast<int32_t>(bitmaps_.size() - 1), 0), pool_)); | |
| const int32_t max_code = | |
| bitmaps_.empty() ? 0 : static_cast<int32_t>(bitmaps_.size() - 1); | |
| PAIMON_ASSIGN_OR_RAISE(auto bsi, | |
| BitSliceIndexBitmap::Appender::Create(0, max_code, pool_)); |
| PAIMON_ASSIGN_OR_RAISE(auto dictionary, | ||
| ChunkedDictionary::Appender::Create( | ||
| factory_, static_cast<int32_t>(chunk_size_bytes_limit_), pool_)); |
There was a problem hiding this comment.
This narrows chunk_size_bytes_limit_ from int64_t to int32_t without validation. If a caller configures a chunk size > INT32_MAX, this can overflow and break dictionary chunking. Consider validating the value and returning Status::Invalid when it exceeds std::numeric_limits<int32_t>::max() (or update the downstream API to accept int64_t).
| Result<std::shared_ptr<RangeBitmapFileIndexReader>> RangeBitmapFileIndexReader::Create( | ||
| const std::shared_ptr<arrow::DataType>& arrow_type, const int32_t start, const int32_t length, | ||
| const std::shared_ptr<InputStream>& input_stream, const std::shared_ptr<MemoryPool>& pool) { | ||
| PAIMON_ASSIGN_OR_RAISE(FieldType field_type, | ||
| FieldTypeUtils::ConvertToFieldType(arrow_type->id())); | ||
| PAIMON_ASSIGN_OR_RAISE(std::unique_ptr<RangeBitmap> range_bitmap, | ||
| RangeBitmap::Create(input_stream, start, field_type, pool)); | ||
| return std::shared_ptr<RangeBitmapFileIndexReader>( | ||
| new RangeBitmapFileIndexReader(std::move(range_bitmap))); | ||
| } |
There was a problem hiding this comment.
The length parameter is ignored, so the reader doesn't enforce bounds of the index payload. This can cause reads to run past the intended segment when multiple indexes share an input stream. Consider using length to bound the readable region (e.g., via a bounded/wrapping InputStream) or validate that the parsed sections (header/dictionary/bsi) fit within [start, start + length).
| std::cout << array_status.message() << std::endl; | ||
| ASSERT_TRUE(array_status.ok()); | ||
| CheckResult(path, {split}, /*predicate=*/nullptr, expected_array); | ||
| } |
There was a problem hiding this comment.
This prints a message even when array_status.ok() is true, which adds noise/flakiness to test logs. Prefer removing the std::cout, or only printing the message on failure (e.g., in the assertion failure message).
| std::cout << array_status.message() << std::endl; | |
| ASSERT_TRUE(array_status.ok()); | |
| CheckResult(path, {split}, /*predicate=*/nullptr, expected_array); | |
| } | |
| ASSERT_TRUE(array_status.ok()) << array_status.message(); | |
| CheckResult(path, {split}, /*predicate=*/nullptr, expected_array); | |
| } | |
| } |
| } | ||
| // Test date type | ||
| { | ||
| // Test greater than predicate: f0 > 5 -> rows 0,3,4,7 (values 17,7,9,10) |
There was a problem hiding this comment.
The comment does not match the code: it references f0 > 5, but the predicate is f4 <= 19725 (DATE). Please update the comment to reflect the actual predicate and expected rows, so the test intent is clear and future maintenance doesn't break expectations.
| // Test greater than predicate: f0 > 5 -> rows 0,3,4,7 (values 17,7,9,10) | |
| // Test less or equal DATE predicate: f4 <= 19725 -> row 1 (date value 19725) |
| void SetUp() override { | ||
| pool_ = GetDefaultPool(); | ||
| fs_ = std::make_shared<LocalFileSystem>(); | ||
| } |
There was a problem hiding this comment.
fs_ and index_buffer_ are not used anywhere in this test file (they're only initialized/reset). Removing these members (and the corresponding setup/teardown code) will reduce clutter and make the fixture easier to understand.
| private: | ||
| std::shared_ptr<FileSystem> fs_; | ||
| std::shared_ptr<Bytes> index_buffer_; | ||
| }; |
There was a problem hiding this comment.
fs_ and index_buffer_ are not used anywhere in this test file (they're only initialized/reset). Removing these members (and the corresponding setup/teardown code) will reduce clutter and make the fixture easier to understand.
| std::map<std::string, std::string> options; | ||
| // Configure a very small chunk size in bytes so that the dictionary must | ||
| // be split into multiple chunks. | ||
| options[RangeBitmapFileIndex::kChunkSize] = "86b"; |
There was a problem hiding this comment.
The chunk-size unit uses lowercase b here, while other test data/options use B (e.g., 16B). If MemorySize::ParseBytes is case-sensitive, this will fail. Consider using a consistent canonical unit like \"86B\" across tests/config.
| options[RangeBitmapFileIndex::kChunkSize] = "86b"; | |
| options[RangeBitmapFileIndex::kChunkSize] = "86B"; |
Purpose
Linked issue: close #146
Tests
UT in
rangebtimap_file_index_test.cppIT in
paimon::test::ReadInteWithIndexTest::CheckResultForRangeBitmapdata is generated using paimon-java v1.3.1.
Same data, same queries, with single-chunk and multi-chunk, result should be the same.
tests are mainly written by AI, reviewed by human.
test coverage:
range_bitmap_file_index.cpp is a little low(82.7%) is because No write integration test to cover CreateWriter Method.
API and Format
Documentation
Generative AI tooling
Generated-by: Kimi K2.5