Why do I have junk front end files in S3?
Let’s begin with the problem space. How do you manage far future cache expiry for assets (CSS/JS/Image) with a CDN such that if you change a file in a release, users get the latest asset, and not a stale one?
An age old trick is to append the hash of every asset (CSS/JS/Image) to it’s file name so that you end up caching a new unique file if a file’s content changes. This is such an old trick, even front end build systems from a decade ago found ways to autoamte this.
It has additional benefits too; When you roll out a new release, your users probably are using the website/app at the very moment you are releasing the new version, OR even after your new release is deployed, some of your users may still have an older HTML file cached on their browser that references older assets (this happens and you have no control over this). So it is typical to not delete the previous release’s assets for couple of hours at least, to prevent current sessions from encountering 404 errors.
So that’s the system I had. Deleting old assets causes issues, so every build appended new assets to my S3 bucket, and never deleted the old ones. After an year of releasing stuff, I had an S3 bucket 4 GB in size! Which of course really could have been few MBs. Yeah, I know, S3 is cheap when it isn’t accessed and just sitting there in storage. Yet it is so sub-optimal. Why have GBs of files in there, when all I need is current releases files + the previous releases files for maybe upto 24 hours?
S3 being cheap combined with me not knowing how to solve this easily is why I put the problem off for so long. My initial thought process involved writing a scheduled node.js job that goes through the S3 files, finding old files and deleting them. But how do I find “old files”, without running the build again and finding the “current files” first?
Solution: Cron Job!?
What if we could tag old files when a new release is deployed? The purpose is to mark the old files for deletion, and a scheduled / cron job could later do the deletion. So.. can we do this? It brings up new questions.
Can S3 objects be tagged natively? Yes, S3 calls, what I call “tags”, as “metadata”.
Second question is how do you tag old files while you upload new ones? One way is that you tag all files in the bucket for deletion, then upload the new build’s files, which would overwrite the files that have not changed, but this time removing the deletion tag. Hacky solution? I can’t think of an easier way. (Comment if you have another solution).
Apparently once an S3 object is uploaded, you can’t edit it’s metadata via CLI without re-uploading the same object again. So the CLI commands to do the above is:
aws s3 cp s3://bucket-name/ s3://bucket-name/ --recursive --metadata to-delete=true --acl bucket-owner-full-control --acl public-read aws s3 cp build s3://bucket-name/ --recursive --acl bucket-owner-full-control --acl public-read
The first command adds a
x-amz-meta-to-delete=true tag to all the files. The second command then potentially overwrites the new files and removes the tag (for the files that have not changed). So the files that still have the tag are indeed old files, that can be deleted later.
The side-effect of this is that the creation date of every file, including old files changes. Which fortunately works in my favor (will come to the reason why).
Next up, we need a scheduled job running that goes through all the tagged files, check if each file has past 24 hours since getting tagged and delete them.
How to find if the file “has past 24 hours since getting tagged”? Well, since the creation date of all files were changed due to the tagging, we could use the creation date of the file instead to calculate this. Another way is that we could add the timestamp into the value part of the tag instead of just using
to-delete=<seconds since epoch>.
If I were to write a node.js job to do the deletion I would prefer adding the timestamp in the tag value. However S3 has a feature called “lifecycle rules”. You can setup a rule to auto delete files based on certain conditions. You can confiure a “rule” to delete all files tagged with
x-amz-meta-to-delete=true, one day after it’s creation date (see! the change in creation date helps us here). The tag has to be a constant, as the rule is configured only once. You can’t use a changing timestamp tag with this solution.
The down-side of this approach is that if you release several times a day, everyday, the old files’ creation date will keep moving forward, so deletion schedule will get pushed forward on every release you do. It is probably ok, as at least during the weekend, when there are no releases, the lifecycle rules would get a chance to delete the files. You have weekends right? right? International team, with daily releases? hmm, that won’t give a breathing space for the rule to delete the files. Here is this assumption I am making for my team: We have a bi-weekly release schedule. So there is ample time for old files to get deleted.
I have taken this solution. Using lifecycle rule, means I don’t have to write a node.js job for this and I don’t have to mess with EventBridge events etc for managing the scheduling. S3 would handle it for me. Seems like a simple solutions with some assumptions that probably won’t break for a long time to come.
Thanks for reading.