Apache Iceberg Setup With AWS EMR and Basic CRUD operations in Iceberg using spark SQL

In this post, we will discuss how to set up Apache Iceberg on the EMR cluster, use AWS glue as the iceberg catalog, run basic CRUD operations in Iceberg tables using spark SQL and see the changes in AWS Athena.

  1. EMR Release: emr-6.6.0
  2. Select Software: Hadoop 3.2.1 and Spark 3.2.0
  3. Specify the below Software Settings Configuration
[
{
"classification":"spark-defaults",
"properties":{
"spark.jars.packages":"org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0,software.amazon.awssdk:bundle:2.15.40,software.amazon.awssdk:url-connection-client:2.15.40",
"spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"spark.sql.defaultCatalog":"glue_catalog",
"spark.sql.catalog.glue_catalog":"org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.glue_catalog.catalog-impl":"org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.glue_catalog.warehouse":"s3://disqo-datalake-nonprod/tmp/ristest/iceberg-test",
"spark.sql.catalog.glue_catalog.io-impl":"org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.sources.partitionOverwriteMode":"dynamic"
}
},
{
"Classification": "hive-site",
"Properties": {
"hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
}
}
]

4. Replace <catalog_name> with the name of the catalog for the iceberg table. You can specify any name (Preferred lower case with underscores). Replace <your_warehouse_s3_path> with the S3 path where iceberg metadata files will be stored.

5. Continue with the other settings as per required.

All the crud operations are reflected in the AWS Athena.

  1. Create Iceberg Database:
spak.sql(""" CREATE DATABASE IF NOT EXISTS <iceberg_database_name> """)

2. Create Iceberg Table:

spark.sql(""" CREATE TABLE IF NOT EXISTS <iceberg_database_name>.<iceberg_table_name> 
(column_name_1 column_type_1, column_name_2 column_type_2 ...)
PARTITIONED BY (column_name)""")

3. Insert values into Iceberg Table:

spark.sql(""" INSERT INTO <iceberg_db>.<iceberg_table> 
VALUES (value_1, value_2),(value_1, value_2)""")

4. Read data from Iceberg Table:

spark.sql(""" SELECT * FROM <iceberg_db>.<iceberg_table>""")

5. Read snapshot history :

spark.sql(""" SELECT * FROM <catalog_name>.<iceberg_db>.<iceberg_table>.snapshots""")

6. Read Data as of <snapshot_id> :

spark.read.option("snapshot-id",<snapshot_id>L)
.table("<catalog_name>.<iceberg_db>.<iceberg_table>")

7. Read Data as of <timestamp> :

spark.read.option("as-of-timestamp", <timestamp>L)
.table("<catalog_name>.<iceberg_db>.<iceberg_table>")

8. Rollback to the previous snapshot:

spark.sql(""" CALL <catalog_name>.system
.rollback_to_snapshot('<iceberg_db>.<iceberg_table>',<snapshot_id>)""")

9. Rollback to the Previous timestamp:

spark.sql(""" CALL <catalog_name>.system
.rollback_to_timestamp('<iceberg_db>.<iceberg_table>',TIMESTAMP <timestamp>)""")

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store