Why do I have junk front end files in S3?

Its complicated

Let’s begin with the problem space. How do you manage far future cache expiry for assets (CSS/JS/Image) with a CDN such that if you change a file in a release, users get the latest asset, and not a stale one?

An age old trick is to append the hash of every asset (CSS/JS/Image) to it’s file name so that you end up caching a new unique file if a source file’s content changes. This is such an old trick, even front end build systems from a decade ago found ways to automate this.

It has additional benefits too; When you roll out a new release, your users probably are using the website/app at the very moment you are releasing the new version, OR even after your new release is deployed, some of your users may still have an older HTML file cached on their browser that references older assets (this happens and you have no control over this). So it is typical to not delete the previous release’s assets for couple of hours at least, to prevent current sessions from encountering 404 errors.

So the simple deployment process I had only added new assets to my S3 bucket, and never deleted the old ones. I didn’t care enough to solve the issue. After an year of releasing stuff, I had an S3 bucket 4 GB in size! Which of course really could have been few MBs. All I need is current releases files + the previous releases files for maybe up to 24 hours. Yeah, I know, S3 is cheap when it isn’t accessed and just sitting there in storage. Yet it is sub-optimal.

S3 being cheap combined with me not knowing how to solve this easily is why I put the problem off for so long. My initial thought process of solving this involved writing a scheduled node.js job that goes through the S3 files, finding old files and deleting them. But how do I find “old files”, without running the build again and finding the “current files” first?

Solution: Cron Job!?

EDIT Dec 27 2022: The original approach of this post has been very slow. I have abandoned this approach to rather just run a aws s3 sync --delete job manually whenever needed.

What if we could tag old files when a new release is deployed? The purpose is to mark the old files for deletion, and a scheduled / cron job could later do the deletion. So.. can we do this? It brings up new questions.

Can S3 objects be tagged natively? Yes.

Second question is how do you tag old files while you upload new ones? You need the list of the old files, and the new ones and then you need to do a diff:

# Find difference between previous build files and new build files
cd build
find . -type f | sed 's/^..//' | sort > ../build-files.txt
cd ..
aws --region eu-west-1 s3api list-objects --bucket bucket-name --query 'Contents[].{Key:Key}' --output text | sort > s3-files.txt
# find diff
comm -13 build-files.txt s3-files.txt > diff.txt

Let’s go ahead and tag the files with tag to-delete=true:

cat diff.txt | xargs -n 1 -P 32 -d '\n' -t aws s3api put-object-tagging --bucket bucket-name --tagging 'TagSet=[{Key=to-delete,Value=true}]' --key

# cleanup intermediate files
rm build-files.txt s3-files.txt diff.txt

Next up, we need a scheduled job running that goes through all the tagged files, check if each file has past 24 hours since getting tagged and delete them.

How to find if the file “has past 24 hours since getting tagged”? One way is to add the timestamp into the value part of the tag instead of just using true. to-delete=<seconds since epoch>. If I were to write a node.js job to do the deletion I would prefer adding the timestamp in the tag value. However S3 has a feature called “lifecycle rules”. You can setup a rule to auto delete files based on certain conditions. You can configure a “rule” to delete all files tagged with to-delete=true, one day after it’s creation date. The tag has to be a constant, as the rule is configured only once. So you can’t use a changing timestamp tag with this solution. But this means the creation date needs to be updated along with the time you tag the files, else the files would get deleted potentially immediately.

So before we do the tagging, we need to force change the creation date by re-uploading the current files from the bucket back into the bucket. AWS CLI won’t allow re-uploading the file with zero changes, so we can force it by setting a useless metadata:

cat diff.txt | xargs -n 1 -P 32 -d '\n' -t -I %1 aws --region eu-west-1 s3 cp s3://bucket-name/%1 s3://bucket-name/%1 --metadata to-delete=true --acl bucket-owner-full-control --acl public-read --no-progress

I have taken this solution. Using lifecycle rule, means I don’t have to write a node.js job for this and I don’t have to mess with EventBridge events etc for managing the scheduling. S3 would handle it for me. Seems like a simple solution with some inefficiencies/trade offs that could scale for some time to come.

EDIT Dec 27 2022: As mentioned before, the original approach of this post is very slow as we have multiple deployments per day on the dev environment and it accumulates enough files in a single day (that gets re-tagged/re-uploaded on every deployment) to slow down the deployment by like 10+ minutes. I have abandoned this approach to rather just run a aws s3 sync --delete job manually whenever needed.

Thanks for reading.


Munawwar Firoz

Software Developer | Web Dev