Tuesday, August 16, 2022
HomeBig DataCourse of Apache Hudi, Delta Lake, Apache Iceberg dataset at scale, half...

Course of Apache Hudi, Delta Lake, Apache Iceberg dataset at scale, half 2: Utilizing AWS Glue Studio Visible Editor


Transactional information lake applied sciences similar to Apache Hudi, Delta Lake, Apache Iceberg, and AWS Lake Formation ruled tables is evolving quickly, and gaining nice reputation. These applied sciences simplified the info processing pipeline considerably, and so they supplied additional helpful capabilities like upserts, rolling again, and time journey queries.

In the primary submit of this collection, we went by methods to course of Apache Hudi, Delta Lake, and Apache Iceberg datasets utilizing AWS Glue connectors. AWS Glue simplifies studying and writing your information in these information lake codecs, and constructing the info lakes on prime of these applied sciences. Working the pattern notebooks on AWS Glue Studio pocket book, you possibly can interactively develop and run your code, then instantly see the outcomes. The notebooks allow you to discover how these applied sciences work when you have got coding expertise.

This second submit focuses on different use circumstances for patrons preferring visible job authoring with out writing customized code. Even with out coding expertise, you may simply construct your transactional information lakes on AWS Glue Studio visible editor, and reap the benefits of these transactional information lake applied sciences. As well as, it’s also possible to use Amazon Athena to question the info saved utilizing Hudi and Iceberg. This tutorial demonstrates methods to learn and write every format on AWS Glue Studio visible editor, after which methods to question from Athena.

Course of Apache Hudi, Delta Lake, Apache Iceberg dataset at scale

Conditions

The next are the directions to learn/write tables utilizing every information lake format on AWS Glue Studio Visible Editor. You should utilize any of {the marketplace} connector or the customized connector primarily based in your necessities.

To proceed this tutorial, you could create the next AWS assets upfront:

Reads/writes utilizing the connector on AWS Glue Studio Visible Editor

On this tutorial, you learn and write every of the transaction information lake format information on the AWS Glue Studio Visible Editor. There are three foremost configurations: connection, connection choices, and job parameters that you could configure per the info lake format. Be aware that no code is included on this tutorial. Let’s see the way it works.

Apache Hudi writes

Full following steps to jot down into Apache Hudi desk utilizing the connector:

  1. Open AWS Glue Studio.
  2. Select Jobs.
  3. Select Visible with a supply and goal.
  4. For Supply, select Amazon S3.
  5. For Goal, select hudi-0101-byoc-connector.
  6. Select Create.
  7. Beneath Visible, select Information supply – S3 bucket.
  8. Beneath Node properties, for S3 supply sort, select S3 location.
  9. For S3 URL, enter s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/json/.
  10. Select Information goal – Connector.
  11. Beneath Node properties, for Connection, select hudi-0101-byoc-connection.
  12. For Connection choices, enter the next pairs of Key and Worth (select Add new choice to enter a brand new pair).
    1. Key: path. Worth: <Your S3 path for Hudi desk location>
    2. Key: hoodie.desk.title, Worth: take a look at
    3. Key: hoodie.datasource.write.storage.sort, Worth: COPY_ON_WRITE
    4. Key: hoodie.datasource.write.operation, Worth: upsert
    5. Key: hoodie.datasource.write.recordkey.subject, Worth: location
    6. Key: hoodie.datasource.write.precombine.subject, Worth: date
    7. Key: hoodie.datasource.write.partitionpath.subject, Worth: iso_code
    8. Key: hoodie.datasource.write.hive_style_partitioning, Worth: true
    9. Key: hoodie.datasource.hive_sync.allow, Worth: true
    10. Key: hoodie.datasource.hive_sync.database, Worth: hudi
    11. Key: hoodie.datasource.hive_sync.desk, Worth: take a look at
    12. Key: hoodie.datasource.hive_sync.partition_fields, Worth: iso_code
    13. Key: hoodie.datasource.hive_sync.partition_extractor_class, Worth: org.apache.hudi.hive.MultiPartKeysValueExtractor
    14. Key: hoodie.datasource.hive_sync.use_jdbc, Worth: false
    15. Key: hoodie.datasource.hive_sync.mode, Worth: hms
  13. Beneath Job particulars, for IAM Function, select your IAM function.
  14. Beneath Superior properties, for Job parameters, select Add new parameter.
  15. For Key, enter --conf.
  16. For Worth, enter spark.serializer=org.apache.spark.serializer.KryoSerializer.
  17. Select Save.
  18. Select Run.

Apache Hudi reads

Full following steps to learn from the Apache Hudi desk that you simply created within the earlier part utilizing the connector:

  1. Open AWS Glue Studio.
  2. Select Jobs.
  3. Select Visible with a supply and goal.
  4. For Supply, select hudi-0101-byoc-connector.
  5. For Goal, select Amazon S3.
  6. Select Create.
  7. Beneath Visible, select Information supply – Connection.
  8. Beneath Node properties, for Connection, select hudi-0101-byoc-connection.
  9. For Connection choices, select Add new choice.
  10. For Key, enter path. For Worth, enter your S3 path in your Hudi desk that you simply created within the earlier part.
  11. Select Rework – ApplyMapping, and select Take away.
  12. Select Information goal – S3 bucket.
  13. Beneath Information goal properties, for Format, select JSON.
  14. For S3 Goal sort. select S3 location.
  15. For S3 Goal Location enter your S3 path for output location.
  16. Beneath Job particulars, for IAM Function, select your IAM function.
  17. Select Save.
  18. Select Run.

Delta Lake writes

Full the next steps to jot down into the Delta Lake desk utilizing the connector:

  1. Open AWS Glue Studio.
  2. Select Jobs.
  3. Select Visible with a supply and goal.
  4. For Supply, select Amazon S3.
  5. For Goal, select delta-100-byoc-connector.
  6. Select Create.
  7. Beneath Visible, select Information supply – S3 bucket.
  8. Beneath Node properties, for S3 supply sort, select S3 location.
  9. For S3 URL, enter s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/json/.
  10. Select Information goal – Connector.
  11. Beneath Node properties, for Connection, select your delta-100-byoc-connection.
  12. For Connection choices, select Add new choice.
  13. For Key, enter path. For Worth, enter your S3 path for Delta desk location. Select Add new choice.
  14. For Key, enter partitionKeys. For Worth, enter iso_code.
  15. Beneath Job particulars, for IAM Function, select your IAM function.
  16. Beneath Superior properties, for Job parameters, select Add new parameter.
  17. For Key, enter --conf.
  18. For Worth, enter spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
  19. Select Save.
  20. Select Run.

Delta Lake reads

Full the next steps to learn from the Delta Lake desk that you simply created within the earlier part utilizing the connector:

  1. Open AWS Glue Studio.
  2. Select Jobs.
  3. Select Visible with a supply and goal.
  4. For Supply, select delta-100-byoc-connector.
  5. For Goal, select Amazon S3.
  6. Select Create.
  7. Beneath Visible, select Information supply – Connection.
  8. Beneath Node properties, for Connection, select delta-100-byoc-connection.
  9. For Connection choices, select Add new choice.
  10. For Key, enter path. For Worth, enter your S3 path for Delta desk that you simply created within the earlier part. Select Add new choice.
  11. For Key, enter partitionKeys. For Worth, enter iso_code.
  12. Select Rework – ApplyMapping, and select Take away.
  13. Select Information goal – S3 bucket.
  14. Beneath Information goal properties, for Format, select JSON.
  15. For S3 Goal sort, select S3 location.
  16. For S3 Goal Location enter your S3 path for output location.
  17. Beneath Job particulars, for IAM Function, select your IAM function.
  18. Beneath Superior properties, for Job parameters, select Add new parameter.
  19. For Key, enter --conf.
  20. For Worth, enter spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog.
  21. Select Save.
  22. Select Run.

Apache Iceberg writes

Full the next steps to jot down into Apache Iceberg desk utilizing the connector:

  1. Open AWS Glue console.
  2. Select Databases.
  3. Select Add database.
  4. For database title, enter iceberg, and select Create.
  5. Open AWS Glue Studio.
  6. Select Jobs.
  7. Select Visible with a supply and goal.
  8. For Supply, select Amazon S3.
  9. For Goal, select iceberg-0131-byoc-connector.
  10. Select Create.
  11. Beneath Visible, select Information supply – S3 bucket.
  12. Beneath Node properties, for S3 supply sort, select S3 location.
  13. For S3 URL, enter s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/json/.
  14. Select Information goal – Connector.
  15. Beneath Node properties, for Connection, select iceberg-0131-byoc-connection.
  16. For Connection choices, select Add new choice.
  17. For Key, enter path. For Worth, enter glue_catalog.iceberg.take a look at.
  18. Select SQL underneath Rework to create a brand new AWS Glue Studio node.
  19. Beneath Node properties, for Node dad and mom, select ApplyMapping.
  20. Beneath Rework, for SQL alias, confirm that myDataSource is entered.
  21. For SQL question, enter CREATE TABLE glue_catalog.iceberg.take a look at AS SELECT * FROM myDataSource WHERE 1=2. That is to create a desk definition with no data as a result of the Iceberg goal requires desk definition earlier than information ingestion.
  22. Beneath Job particulars, for IAM Function, select your IAM function.
  23. Beneath Superior properties, for Job parameters, select Add new parameter.
  24. For Key, enter --conf.
  25. For Worth, enter the next worth (exchange the placeholder your_s3_bucket together with your S3 bucket title): spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://your_s3_bucket/iceberg/warehouse --conf spark.sql.catalog.glue_catalog.catalog-impl --conf park.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.desk=iceberg_lock --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  26. Select Save.
  27. Select Run.

Apache Iceberg reads

Full the next steps to learn from Apache Iceberg desk that you simply created within the earlier part utilizing the connector:

  1. Open AWS Glue Studio.
  2. Select Jobs.
  3. Select Visible with a supply and goal.
  4. For Supply, select Apache Iceberg Connector for AWS Glue 3.0.
  5. For Goal, select Amazon S3.
  6. Select Create.
  7. Beneath Visible, select Information supply – Connection.
  8. Beneath Node properties, for Connection, select your Iceberg connection title.
  9. For Connection choices, select Add new choice.
  10. For Key, enter path. For Worth, enter glue_catalog.iceberg.take a look at.
  11. Select Rework – ApplyMapping, and select Take away.
  12. Select Information goal – S3 bucket.
  13. Beneath Information goal properties, for Format, select JSON.
  14. For S3 Goal sort, select S3 location.
  15. For S3 Goal Location enter your S3 path for the output location.
  16. Beneath Job particulars, for IAM Function, select your IAM function.
  17. Beneath Superior properties, for Job parameters, select Add new parameter.
  18. For Key, enter --conf.
  19. For Worth, enter the next worth (exchange the placeholder your_s3_bucket together with your S3 bucket title): spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.warehouse=s3://your_s3_bucket/iceberg/warehouse --conf spark.sql.catalog.glue_catalog.catalog-impl --conf park.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.desk=iceberg_lock --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
  20. Select Save.
  21. Select Run.

Question from Athena

The Hudi desk and the iceberg tables created with the above directions are additionally queryable from Athena.

  1. Open the Athena console.
  2. Run the next SQL to question the Hudi desk:
    SELECT * FROM "hudi"."take a look at" LIMIT 10

  3. Run the next SQL to question the Iceberg desk:
    SELECT * FROM "iceberg"."take a look at" LIMIT 10

If you wish to question the Delta desk from Athena, observe Presto, Trino, and Athena to Delta Lake integration utilizing manifests.

Conclusion

This submit summarized methods to make the most of Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue platform, in addition to demonstrated how every format works with the AWS Glue Studio Visible Editor. You can begin utilizing these information lake codecs simply in any of the AWS Glue DynamicFrames, Spark DataFrames, and Spark SQL on the AWS Glue jobs, the AWS Glue Studio notebooks, and the AWS Glue Studio visible editor.


In regards to the Creator

Noritaka Sekiyama is a Principal Large Information Architect on the AWS Glue group. He enjoys collaborating with completely different groups to ship outcomes like this submit. In his spare time, he enjoys enjoying video video games along with his household.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular