Autoscaling behavior of various cloud services

May 10th, 2023

Its is hard to find any information on how cloud services handle a sudden spike in number of requests, which makes it hard to have an informed decision when picking a cloud service. This blog post is written like a quick note so forgive my writing style.

Consider the following scenario: I have a service receiving 1 concurrent request at a given time and its been steady that way for hours. One of my clients suddenly sends a batch request of 100 concurrent request. How does the cloud infrastructure respond to this sudden uptick in requests? Let’s also say that I want one instance to handle only few concurrent request at a time. For the purpose of studying scaling, let’s just say 1 instance should handle only 1 concurrent request. Let’s go through some of the cloud services I have tested this case on.

AWS ECS Fargate + ELB

Fargate + ELB takes at least a minute to start auto scaling up because cloudwatch metrics only gets updated in minimum 1 minute interval. Till then the current instances will accept all of the requests. In some cases this behavior is not bad, but in other this could make all requests slow if the instance starts hitting its max resource usage.

AWS App Runner

I set max concurrency to 1, and max instances to 25 (thats the max AWS allows currrently) and sent a spike of 100 requests. It seems to throw HTTP status 429 “Max queue length has been reached” error on the first spike it hits. And then it scales up, so it gets ready for the next spike. If I send another batch of 100 concurrent requests after a few seconds, it seems to be able to handle it.

fly.io

(If you haven’t heard of fly.io: They are a newer player in town and it works like app runner, but makes it easy to deploy to multiple regions.)

I set max concurrency hard limit to 1 and max instances to 10. fly.io throws connection reset errors on such a spike. Sending another batch of 100 requests after a few seconds seems to do the same. Changing max instances doesn’t seem to change this behavior. I guess it takes a bit of time to scale.

GCP Cloud Run

I set max concurrency to 1 and max instances 100. Cloud run will spawn new instances immediately, which of course incurs a cold start, but responds to all the requests without throwing any errors. Here is something nice about Cloud Run, it won’t spawn 100 instances, rather it seems to only create total of 2 active instances.

My theory is that Cloud Run queues the requests and takes the cold start time and historical request latency percentile into account to decide how many instances to create. So I see my container takes ~250 ms to cold start an instance and historical p95 response time is ~5 ms, then it does a sort of calculation like “alright, within 250 ms I can serve 250/5 = 50 requests per instance from the queue within the time a new instance cold starts. So for 100 requests I just need to have 100/50 = 2 instances active”.

This is how I think it works, but I cannot say for sure. Nonetheless it does something smart without failing and without spawning 100 instances, rather just 2 active instances (+ 4 more inactive instances, I wonder why).

AWS Lambda

Unlike other services listed here, Lambda can only have a concurrency of 1 request. So lets say I was sending 1 request at a time and then I suddenly send 100 concurrent requests. This will cause 99 new instances to be created. The new instances will incur a cold start, but does handle the load with no request failing.

However lambda introduces new constraints in the code and tech stack you use. e.g. this would cause 100 database connections to your database. So you need a DB (or a layer in front of the DB like pgbouncer) that can handle that. Also you will observe lambdas having cold starts even when the traffic is steady. I guess because AWS shuts some instances and creates new instances from time to time.

The Winner?

Before this blog post’s conclusion is taken out of context, let me first say that if your traffic pattern increases steadily, then all the services above will work for you. Even VPS from DigitalOcean or Linode or Scaleway or Hetzner etc would work. So “the winner” is all of them really. I like regular single instance VPS for small applications and fly.io in particular for making multi regional deployments easy.

However specifically for handling sudden spikes in traffics, I like the way Cloud Run auto-scales. It scales without consuming too much resources and without failing or throwing 429s. The Runner up is AWS Lambda as it too handles all the requests quickly without failing. ECS did handle the spike, but as I mentioned before, it is risky for spiky workloads as it could get overwhelmed trying to handle large number of concurrent requests.

I haven’t included Deno in this comparison, as Deno is javascript / webassembly specific tech. I am picking services that can run any/many tech stacks. If I did consider JS specific runtimes, especially the ones that does not run a full container/VM, then I could say Cloudflare worker would be one of the best in its ability to handle spiky work loads. Cloudflare worker has 0 ms base cold start and can scale immediately to a spike in requests. However it comes with several constraints - e.g. to name a few: it can only have 5 MB deployment package; it doesn’t have all the APIs of node.js which limits what you can pick from npm.

That’s all for this post. Let me know if there are other cloud services that you think are better or similar.

Backend