Tracking Processed Data Using AWS Glue Job Bookmarks | Incremental ETL In-depth intuition
Knowledge Amplifier Knowledge Amplifier
28.4K subscribers
6,786 views
98

 Published On Jun 6, 2022

Why only incremental ingestion ? Why not complete Incremental Pipeline starting from Ingestion , Curation & Publishing data ?

If the business logic what we are implementing in curate layer not dependent on past processed data ,then not only ingestion , complete pipeline we can make as Incremental & AWS Glue give the opportunity to do so using one of it's most powerful feature --Job Bookmarking 😊

Today in this video , I have discussed about Job Booking concept in Glue .

For details , you can refer this documentation --
https://docs.aws.amazon.com/glue/late...

V.V.I. Note:
-----------------
To identify which files stored on S3 to process, job bookmarks check the last modified time of the objects, not the file names. If your input objects changed since the last time the job ran, then they are reprocessed when the job runs again.

Prerequisite:
------------------------
An automated data pipeline using Lambda, S3 and Glue - Big Data with Cloud Computing
   • An automated data pipeline using Lamb...  

How to Use AWS Glue with Snowflake | PySpark-Snowflake Connectivity
   • How to Use AWS Glue with Snowflake | ...  

Set up the necessary AWS services to query the data inside an Amazon S3 (Datalake) using AWS Athena
   • Set up the necessary AWS services to ...  

Transform Data Using AWS Glue and Amazon Athena
   • Transform Data Using AWS Glue and Ama...  

Schema Evolution in AWS Glue using Glue Crawler | AWS Athena
   • Schema Evolution in AWS Glue using Gl...  

Simplify Amazon DynamoDB data extraction and analysis by using AWS Glue and Amazon Athena
   • Simplify Amazon DynamoDB data extract...  

AWS Glue as Hive catalog
   • Using the AWS Glue Data Catalog as th...  

A very frequent technical requirement in big data domain--

You have to write spark dataframe but with kms encryption, if you are using Glue , then this is one approach you can try to improve the security of your pipeline by enabling server side encryption
   • Security Configurations on the AWS Gl...  

Incremental Glue crawling using Amazon S3 Event Notifications
   • Incremental Glue crawling using Amazo...  

Check this playlist for more Data Engineering related videos:
   • Demystifying Data Engineering with Cl...  

show more

Share/Embed