ian-whitestone / pyspark-vs-dask

[WIP] Comparing pyspark and dask for speed, memory/CPU usage, and ease of use

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

File cleanup

ian-whitestone opened this issue · comments

When generating the fake data, the scripts started interfering with each other (same filenames) part way, so cancelled the jobs and started with new file prefixes.

Need to clean up the old files with the outdated prefixes.

import re
import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('dask-avro-data')

reg = 'application-data\/\d*.avro|fulfillment-data\/\d*.avro|scoring-data\/\d*.avro'

objects = []
for object in my_bucket.objects.all():
    objects.append(object)



for object in objects:
    if re.match(reg, object.key):
        object.delete()