anupwarrier / iceberg_athena

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Iceberg_Athena - ACID Transactions

Purpose

This repo includes examples to demonstrate how you can create an iceberg supported Athena table and interact with data in S3 via Athena ACID Transactions.

Step -by step flow

  1. Create S3 buckets to host the incoming data. For the demo, we will be creating 2 buckets, ie "homedetailsin" and "homedetailsall" buckets. The homedetailsin bucket will have transactional data thats coming from a database for example. The homedetailsall bucket will hold all the changes for the incoming data is accumulated.

  2. Create upload the sample csv file to homedetailsin bucket.

  3. Create a database in Glue Data Catalog.

  4. Create tables in Athena.

--Create Athena Tables
CREATE EXTERNAL TABLE homedetailsin
(
    home_id bigint,
    home_name string,
    value bigint,
    tour_date date,
    status string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'S3://homedetailsin/'
--Create Athena Tables with Iceberg format
CREATE TABLE homedetailsall
(
    home_id bigint,
    home_name string,
    value bigint,
    tour_date date,
    status string
)
PARTITIONED BY (day(tour_date))
LOCATION 's3://homedetailsall/'
TBLPROPERTIES ( 'table_type' = 'ICEBERG' )
  1. Run a SELECT query to retrive the data from S3 via Athena to ensure you can retrive the data
--query the data in S3 via Athena for homedetailsin
SELECT * FROM homedetailsin
  1. Insert into homedetailsall table from homedetailsin to populate some data
--populate data to homedetailsall from homedetailsin
INSERT INTO homedetailsall
SELECT * FROM homedetailsin
  1. Run a SELECT query to retrive the data from S3 via Athena to ensure you can retrive the data
--query the data in S3 via Athena for homedetailsall
SELECT * FROM homedetailsall
  1. Explore the S3 bucket homedetailsall to see the partitions that were created and the file in parquet format

  2. Update the table homedetailsall

UPDATE homedetailsall
SET VALUE=100000 WHERE home_id=1

--update the homedatsilsall table

UPDATE homedetailsall
SET VALUE=200000 WHERE home_id=1
  1. Time travel with snapshots Helps us see how data has changed over a period
SELECT * FROM "homedetailsall$history"

The result of the query shows how data has changed over a period of time using Snapshots(snapshot_id for ref)

If we query using a snapshot_id, we can see the data as of that snapshot_id

SELECT * FROM homedetailsall for VERSION AS of <snapshot_id here>
  1. Update the csv file simulating changes from the upstream databases & upload the file to S3 homedetailsin bucket.

  2. Merge the data to reflect the new updates from homedetailsin under homedetailsall

--MERGE
MERGE INTO homedetailsall hda USING homedetailsin hdi
    ON (hda.home_id = hdi.home_id)
    WHEN MATCHED
        THEN UPDATE SET home_name = hdi.home_name, value = hdi.value, tour_date = hdi.tour_date, status = hdi.status
    WHEN NOT MATCHED
        THEN INSERT (home_id,home_name,value,tour_date,status)
                VALUES(hdi.home_id,hdi.home_name,hdi.value,hdi.tour_date,hdi.status)

About

License:Creative Commons Zero v1.0 Universal