Skip to content

feat(puffin): add basic data structures and constants#588

Open
zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
zhaoxuan1994:feat/puffin-basic-types
Open

feat(puffin): add basic data structures and constants#588
zhaoxuan1994 wants to merge 1 commit intoapache:mainfrom
zhaoxuan1994:feat/puffin-basic-types

Conversation

@zhaoxuan1994
Copy link

Add the foundational types for Puffin file format support:

  • Blob, BlobMetadata, FileMetadata structs
  • PuffinCompressionCodec enum with codec name conversion
  • StandardBlobTypes and StandardPuffinProperties constants
  • ToString functions for all types
  • 24 unit tests covering all public APIs

Copilot AI review requested due to automatic review settings March 11, 2026 08:23
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds foundational C++ types and helpers for Iceberg Puffin file support, plus unit tests, and wires them into the build.

Changes:

  • Introduces core Puffin data structures (Blob, BlobMetadata, FileMetadata) and standard constants (StandardBlobTypes, StandardPuffinProperties).
  • Adds PuffinCompressionCodec with name conversion/parsing helpers and ToString() functions.
  • Adds a dedicated puffin_test suite and integrates Puffin sources into the Iceberg library build.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/iceberg/puffin/blob.h Defines Blob structure for uncompressed blob payload + metadata.
src/iceberg/puffin/blob.cc Implements ToString(const Blob&).
src/iceberg/puffin/blob_metadata.h Defines BlobMetadata footer metadata structure.
src/iceberg/puffin/blob_metadata.cc Implements ToString(const BlobMetadata&).
src/iceberg/puffin/file_metadata.h Defines FileMetadata footer metadata structure.
src/iceberg/puffin/file_metadata.cc Implements ToString(const FileMetadata&).
src/iceberg/puffin/puffin_compression_codec.h Declares codec enum + name conversion APIs.
src/iceberg/puffin/puffin_compression_codec.cc Implements codec name conversion/parsing + ToString.
src/iceberg/puffin/types.h Adds standard Puffin blob type/property string constants.
src/iceberg/puffin/CMakeLists.txt Installs Puffin headers via CMake.
src/iceberg/CMakeLists.txt Adds Puffin sources to ICEBERG_SOURCES and adds puffin subdirectory.
src/iceberg/test/puffin_test.cc Adds unit tests covering the new Puffin public APIs.
src/iceberg/test/CMakeLists.txt Registers puffin_test in the CMake test suite.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-basic-types branch 2 times, most recently from 8a7ed69 to 7e5818d Compare March 11, 2026 09:29
Copy link
Contributor

@evindj evindj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good just curious about the choice of using string literals instead of enums in some place.

Comment on lines +29 to +38
/// \brief Standard blob types defined by the Iceberg specification.
struct StandardBlobTypes {
/// A serialized form of a "compact" Theta sketch produced by the
/// Apache DataSketches library.
static constexpr std::string_view kApacheDatasketchesThetaV1 =
"apache-datasketches-theta-v1";

/// A serialized deletion vector according to the Iceberg spec.
static constexpr std::string_view kDeletionVectorV1 = "deletion-vector-v1";
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do enum + toString ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the Puffin spec, type is a "JSON string" field. The spec lists known blob types under "Blob types" but doesn't restrict the field to only those values — new types can be added as the spec evolves. Both the Java and Rust implementations use String for this field rather than an enum. StandardBlobTypes provides well-known constants to avoid typos while keeping the field open for future extensibility.

blob.input_fields);
std::format_to(std::back_inserter(repr), "snapshotId={},sequenceNumber={},",
blob.snapshot_id, blob.sequence_number);
std::format_to(std::back_inserter(repr), "dataSize={}", blob.data.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: may be add a checksum if available for fast comparison(I do understand checksum is not part of the puffin spec).

Copy link
Author

@zhaoxuan1994 zhaoxuan1994 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting idea! Currently Blob doesn't carry a checksum field (the Puffin spec doesn't include one either), so there's nothing available to display here. For equality comparison, Blob already has operator==. If we want to add checksum support down the road, it would probably be a separate discussion about the struct design itself — happy to explore that if there's a use case!

/// including the blob's location within the file.
struct ICEBERG_EXPORT BlobMetadata {
/// Type of the blob. See StandardBlobTypes for known types.
std::string type;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is weird that the type is declared as string. it makes it harder to understand what type of blobs are accepted but harder to catch typos.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reasoning as StandardBlobTypes — the spec requires this to be an open string.

int64_t sequence_number;
/// Offset in the file where the blob data starts.
int64_t offset;
/// Length of the blob data in the file (after compression, if compressed).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to check if the blob is compressed, Is it if the compression_codec is set?

Copy link
Author

@zhaoxuan1994 zhaoxuan1994 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. If compression_codec has a value, the blob is compressed; if it's std::nullopt, it's uncompressed. I'll improve the comment to make this clearer.

/// Length of the blob data in the file (after compression, if compressed).
int64_t length;
/// Compression codec name, or std::nullopt if uncompressed.
std::optional<std::string> compression_codec;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto same pattern string or enum

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BlobMetadata is a direct mapping of the Puffin footer JSON, where compression-codec is a JSON string (or absent). The Java implementation also uses @nullable String here. The conversion from string to PuffinCompressionCodec enum happens at a higher level via PuffinCompressionCodecFromName(), which returns Result<> to properly handle unknown codecs. This layering keeps the metadata faithful to the on-disk format and provides forward compatibility — if a future spec version adds a new codec, we can still deserialize the footer successfully and fail gracefully at decompression time rather than at parsing time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove std::optional here since an empty string is sufficient to represent the unset state.

case PuffinCompressionCodec::kZstd:
return kZstdCodecName;
}
std::unreachable();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider returning Result<> so we can avoid throwing an exception here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unreachable() here isn't actually throwing — it's a C++23 compile-time hint that this code path is impossible, since the switch already covers all enum values. This is a common pattern in the codebase (e.g., transform.cc, type.cc, expression.cc, etc.). Returning Result<> would require callers to handle an error that can't happen in practice. That said, PuffinCompressionCodecFromName does return Result<> since its string input can genuinely be invalid — so the two functions intentionally use different strategies based on their input types.

Add the foundational types for Puffin file format support:
- Blob, BlobMetadata, FileMetadata structs
- PuffinCompressionCodec enum with codec name conversion
- StandardBlobTypes and StandardPuffinProperties constants
- ToString functions for all types
- 24 unit tests covering all public APIs
@zhaoxuan1994 zhaoxuan1994 force-pushed the feat/puffin-basic-types branch from 7e5818d to 0721621 Compare March 16, 2026 04:09
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay! I have left some general comments. Let me know what you think.

Comment on lines +36 to +39
/// \brief A blob to be written to a Puffin file.
///
/// This represents the uncompressed blob data along with its metadata.
/// The actual compression is handled during writing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// \brief A blob to be written to a Puffin file.
///
/// This represents the uncompressed blob data along with its metadata.
/// The actual compression is handled during writing.
/// \brief A blob in a Puffin file.

Let's be concise and accurate. We don't need to mention either write or compress here since this is just a blob entry.

/// This represents the uncompressed blob data along with its metadata.
/// The actual compression is handled during writing.
struct ICEBERG_EXPORT Blob {
/// Type of the blob. See StandardBlobTypes for known types.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Type of the blob. See StandardBlobTypes for known types.
/// \brief Type of the blob. See StandardBlobTypes for known types.

Let's consistently use the doxygen style comment

Comment on lines +50 to +51
/// Length of the blob data in the file. If compression_codec is set, this is
/// the compressed size; otherwise it is the raw size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Length of the blob data in the file. If compression_codec is set, this is
/// the compressed size; otherwise it is the raw size.
/// \brief Length of the blob data in the file.

IMO this is obvious.

/// Length of the blob data in the file (after compression, if compressed).
int64_t length;
/// Compression codec name, or std::nullopt if uncompressed.
std::optional<std::string> compression_codec;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove std::optional here since an empty string is sufficient to represent the unset state.


namespace iceberg::puffin {

/// \brief Metadata about a Puffin file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend putting definitions of Blob, BlobMetadata and other puffin related classes to this file. We don't need to use a separate file for each class.

std::optional<std::string_view> CodecName(PuffinCompressionCodec codec) {
switch (codec) {
case PuffinCompressionCodec::kNone:
return std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not returning an empty string? Then you don't need to wrap the return type with optional.

}

Result<PuffinCompressionCodec> PuffinCompressionCodecFromName(
std::optional<std::string_view> codec_name) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above.

// Blob Tests
// ============================================================================

TEST(BlobTest, Equality) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I think test cases like this are unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants