SparkTrail

Use this Python script to start a Spark standalone session to interact with the CloudTrail bucket. Spark allow to query the logs using a SQL-like syntax.

The startup main.py script will automatically load SSO credentials and set AWS temporary credentials from the SSO to authenticated to the bucket.

Usage

Spawn Poetry shell:

poetry shell

Start the cluster:

PYSPARK_DRIVER_PYTHON=ipython PYTHONSTARTUP=main.py pyspark --packages org.apache.hadoop:hadoop-aws:3.3.4,com.amazonaws:aws-java-sdk-bundle:1.12.262 --driver-memory 15G --executor-memory5G --name SparkTrail

Now the environment is configured and to start running queries link the S3 bucket. When the IPython shell is created, link the bucket and start performing queries:

spark = link_s3("audit-cloudtrail-logs/AWSLogs/")
spark.select("Records.eventName").distinct().show(10)

Notes

You can use this script to any bucket with partitioned JSON files (e.g., Databricks audit logs)
You may need to adjust the memory sizing to fit your environment
You need to aws sso login before running the command

About

Query AWS CloudTrail using Spark (python) to perform analysis

Languages

Language:Python 100.0%