User:Milimetric/Learning iceberg

From Wikitech

((in the process of moving this to a set of slides since it's way too boring as a wiki page))

w:Apache Iceberg. Let's learn.

Data is stored in a data file. Find these in <<your table>>/data/<<here>>. A manifest file points to multiple data files. A manifest list points to multiple manifest files. A snapshot is metadata that includes a table schema and a pointer to a manifest file. Find these in <<your table>>/metadata/<<here>> in various avro, json, and text formats. There are two main reasons for this structure.

  • Access to exactly the data you want. Whether you're updating a record, querying, or joining, Iceberg metadata wants you to be able to access the data you want with as little IO overhead as possible.

Tuning

It's important[1] to understand your file formats when tuning Iceberg. The data file size, for example, should be coordinated with the rowgroup size and HDFS block size.

Experiments

Guidelines

You can not upsert one million rows per second into an Iceberg table. So what are the limits implied by the metadata mechanisms Iceberg uses? It guarantees atomic operations: if a snapshot exists, it will be consistent no matter how or when it's accessed. To


Links & References

  1. https://www.dremio.com/blog/tuning-parquet/