User:Milimetric/Learning iceberg
((in the process of moving this to a set of slides since it's way too boring as a wiki page))
w:Apache Iceberg. Let's learn.
Data is stored in a data file. Find these in <<your table>>/data/<<here>>
.
A manifest file points to multiple data files. A manifest list points to multiple manifest files. A snapshot is metadata that includes a table schema and a pointer to a manifest file. Find these in <<your table>>/metadata/<<here>>
in various avro, json, and text formats. There are two main reasons for this structure.
- Access to exactly the data you want. Whether you're updating a record, querying, or joining, Iceberg metadata wants you to be able to access the data you want with as little IO overhead as possible.
Tuning
It's important[1] to understand your file formats when tuning Iceberg. The data file size, for example, should be coordinated with the rowgroup size and HDFS block size.
Experiments
- User:Milimetric/Learning_iceberg/Spark_streaming_append. How can we update iceberg tables from Spark streaming?
- User:Milimetric/Learning_iceberg/Copy_large_table. Move a lot of data into an iceberg table.
Guidelines
You can not upsert one million rows per second into an Iceberg table. So what are the limits implied by the metadata mechanisms Iceberg uses? It guarantees atomic operations: if a snapshot exists, it will be consistent no matter how or when it's accessed. To
Links & References
- All the configuration options
- Expedia adds Hive integration
- Compact only when you have to, but if you do consider zordering and clustering
- More Maintenance
- Iceberg intro with a more complete example
- Using Coalesce, Repartition, etc. from Spark SQL
- Fine tuning vectorization, especially for complex types | Query planning time and manifests
- Understand Spark memory management