Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Serverless at Scale: Lessons from 200M Lambda Invocations (adadot.com)
63 points by thunderbong on Nov 12, 2023 | hide | past | favorite | 73 comments


> Serverless architecture promises flexibility, infinite scalability, fast setups, cost efficiency, and abstracting infrastructure, allowing us to focus on the code.

The only thing I know that serverless architecture promises are big bills and a steady income for a cloud provider. I'd be happy to see a serverless setup that won't be blown away with a (way cheaper) small/medium-sized VM.


Personally I found serverless incredibly useful for very simple tasks, tasks where even a t2.micro would be overkill. I have a couple of static websites that are mostly static but still occasionally have to do "server stuff" like sending an email or talking to a database. For those instances a Lambda is incredibly useful because it costs you nothing compared to an EC2 and it less maintenance. But for bigger setups I agree it would be easier to just host on a small-medium VM (and I say that as someone whose got an entire API of like 200+ endpoints deployed in lambda)


I’ve seen many claim this but I just haven’t experienced it in production and I’m not sure why. We use lambda at my work and we serve several million users a month. Lambda bill comes out to around 200$ a month, not exaggerating. API gateway ends up being more expensive every month than lambda.

I’m asking not to seem snarky, I truly want to know what is making people hit high prices with lambda. Is it like functions that are super computationally intensive and require queued lambda functions?


In my business, we heavily use FaaS, and I agree that it seems economical. It's a little surprising, though. AWS Lambda is 5 to 6 times higher price per second than an EC2 instance for a given level of memory and CPU. Our application simply doesn't need much CPU time. Other aspects (database, storage) are more expensive.

The main advantage, though, is predictability of operations. The FaaS services "just work". If we accidentally make a change to one endpoint to consume too much resources, it doesn't affect anything else. It's great for allowing fast changes to new functionality without much risk of breaking mature features.


You have to manage a VM. For example, ensuring that the VM has an up to date OS. If you don't care about that, ok, but that's something that Lambda offers.

Ephemerality is a plus as well. Just from a security standpoint, having an ephemeral system means persistence is not possible.


These things can be easily automated, and many Linux LTS distros come with a preconfigured system auto-update.

> Just from a security standpoint, having an ephemeral system means persistence is not possible.

You still have to persist something somewhere, and there is a higher chance someone will figure out SQL injection or unfiltered POST request through your app than hack SSH access to the box. If someone wants to do any real damage, they'd just continuously DDoS that serverless setup, and the cloud provider will kill the company with the bill.


> These things can be easily automated,

Is this something people are out there believing? That patching is something that's easy to automate? I find that kind of nuts, I thought everyone understood that this is, in fact, the opposite of easily automated...

> You still have to persist something somewhere, and there is a higher chance someone will figure out SQL injection or unfiltered POST request through your app than hack SSH access to the box.

"This entirely separate attack exists therefor completely removing an entire attack primitive haves no value" - how I read this comment.


> "This entirely separate attack exists therefor completely removing an entire attack primitive haves no value"

It has value, but it's also true that trusting cloud providers serveless infrastructure introduces additional sets of vulnerabilities due to various reasons.

eg: https://sysdig.com/blog/exploit-mitigate-aws-lambdas-mitre/

Reading your comments, I get the impression that you are used to dealing with clients whose infrastructure management skills are lacking, and they are making a mess of things.

While serverless infrastructures certainly eliminate a range of vulnerability classes, it is adoption is unlikely to be sufficient to secure platforms that are inadequate for the threats they face.

At the end of the day, someone has to put in the work to ensure that things are patched, safe, and secure, whether the computing model is serverless or not.


> Reading your comments, I get the impression that you are used to dealing with clients whose infrastructure management skills are lacking, and they are making a mess of things.

I mean, I worked at Datadog when this happened: https://www.datadoghq.com/blog/engineering/2023-03-08-deep-d...

Multi-day outage because of an apt update.

Not the only one I've seen, and it's by no means the only issue that occurs with patching (extremely common that companies don't even know if they're patched for a given vuln).


Patching is easy to automate if you’re familiar with managing Linux boxes at scale, and have a test environment.

Ansible on a cron, and the pipeline goes to prod if the test environment passes.

Or unattended upgrades in test, that fires a job to prod if it passes.

Or a continuous build process with Packer to replace running instances once they pass.

If you have certain things that can’t tolerate sudden downtime (a DB, etc.) then you need to know how to mask/hold those.

All this to say, it’s easy if you already know the footguns. But IMO, if you don’t know them, you don’t really have any business running Linux boxes in prod.


I guess this answers my "do people seriously believe this" question.


I disagree, it is much much more likely that a former dev would leak a SSH key than someone would care to find an SQL injection.


That scares me from using things like Firebase or serverless. I'm trusting the who bills me from protecting me from overbilling.


sudo yum update, once a week


This really doesn’t work in practice. We brought down our whole compute cluster once when someone ran update and it changed some stupid thing.

It will happen to you sooner or later also. Updates are always out of band for this reason.

That’s why everybody does builds and isolates isolates updates to that process.


I feel like we’re close to that with automated testing and rollbacks. Still, it seems like a ton of complexity for what feels like a fundamental need


> We brought down our whole compute cluster once when someone ran update and it changed some stupid thing

I've seen this happen twice now, as well.


If it's one VM for a personal website that'll be fine. Good luck explaining that to a SOC2 auditor, or managing that across a fleet.


Nooo I can't run that, I need to pay for Bezos 3rd yacht!


that is not nearly enough. You need to make sure the system still works after the update, so you need to carefully control all the versions that go in, test them in lower environments. Also, for any serious application, you will need to do this multiple times, for hundreds or thousands of machines even for small companies.

I am on your side, actually, I think managing machines is better than serverless, but it's not that easy.


a recipe for getting exploited by a supply chain attack


A reasonably priced serverless (kinda) setup is possible on Fly.io.

Fly charges for a VM by the second, and when a VM is off, RAM and CPU are not charged (storage is still charged). They also allow you to easily configure shutting down machines when there are no active requests (right from their config file `fly.toml`), and support a more advanced method which involves your application essentially terminating itself when there's no work remaining, which kills the VM. When a new request arrives, it starts back up.

Here are the docs [0]. And here's a blog post on how to terminate the VM from inside a Phoenix app for example [1].

So essentially, you can write an app which processes multiple requests on the same VM (so not really serverless), but also saves costs when its not in use (essentially the promise of serverless).

[0] https://fly.io/docs/apps/autostart-stop/

[1] https://fly.io/phoenix-files/shut-down-idle-phoenix-app/


That’s an on-demand server, not serverless.


I once migrated a monitoring task from a smallish VM to GCP Cloud Functions triggered by the Cloud Scheduler. Nothing super fancy; imagine some code that runs on a cron once per hour, collects a bunch of metrics on various things then writes the results somewhere. Straightforward but takes a couple minutes to execute and needs a good chunk of memory while it's running. Costs went from about $100/month for the VM size we needed to $0.12/month for the Cloud Functions version. Plus there was no longer a VM that needed to be secured, updated, monitored, etc. That aspect was arguably the much bigger savings than the basic VM cost.

My current company is running entirely on Cloud Run. Not quite as "serverless" as pure Functions or Lambdas, but we have zero VMs or hardware that we manage, so I feel like it counts. We don't do huge amounts of traffic (and don't need to), but it's not trivial either and it's very spiky. The Cloud Run part of our setup is almost negligable (dominated by the database and storage/network costs by several orders of magnitude). With that we get easy deploys and rollbacks, auto-scaling, ephemeral preview environments for every PR, and a simple security story (when the security questionnaire spreadsheets come around, I get to just say "not applicable" and skip entire sections on host-based security, SSH keys, OS updates, etc). And it's basically just a standard Docker image for the app, so if we ever felt like it would be more cost effective to run it on a VM or K8s cluster, it wouldn't be that difficult.

I agree that not everything is better with serverless, but there are some things where it's just a vastly better fit.


A use case where you’re executing arbitrary code provided by users and you don’t want to have to maintain the environment for doing so (reliability, security boundaries, etc).


Your devs would need one such VM each for testing. It at least you’ll need 2 VMs - staging and production.


You can run efficient VM-like distro containers inside VMs (eg. LXD), and if necessary VMs inside VMs. So you only need one rented VM at the top, if your performance needs are low.

It probably doesn't make sense economically, due to the cost of managing the infra vs the cost of more top-level VMs from a provider, but you can certainly segregate prod, staging, and independent spaces for multiple developers (or yourself on different projects) on a single rented VM if you want to.


Of course this can be done in many different ways. The point is with serverless it is much easier to achieve.


I think this does a decent job of talking about the tech without hyping up or shitting on it, just speaking matter-of-factly. They seem to have found a decent fit for the tech, and are able to recognize where it works and doesn't work, and are still able to 'diversify' to other things. Good post thanks for sharing.


In other words, they had an average of 6.3 lambda invocations per second.

Why make it sound so sensational? I did much more than that on a single xeon machine.


I thought lambdas are practical when they're called rarely, with a surge of traffic occasionally? So that you don't pay for the servers when they're unused, and occasionally you can withstand black friday traffic.

If a lambda is called 6 times per second, I suspect the underlying VMs/containers that power lambdas are rarely shut down (I don't know how AWS works but that's how it works in another cloud provider I'm familiar with - they wait for a little for new requests before shutting down the container). So might've as well just used an always-on server.

I also wonder why their calculations show that 6rps (that's what their "17 mln monthly lambda invocations" really means) would require 25 servers. We have a single mediocre VM which serves around 6 rps on average as well without issues... Although, of course, it all depends on what kind of load each request has. We don't do number crunching and most of the time is spent in the database.


> why their calculations show that 6rps (that's what their "17 mln monthly lambda invocations" really means) would require 25 servers

I'm guessing that it has something to do with the average job taking 15 minutes. 6rps represents 6 jobs being created per second, but each one takes 15 minutes to run until completion. Another way to look at it is each second 90 minutes of lambda work is created.

If you consider 15 minutes the rolling window, which looks like a fair assumption based on the graph provided, there could be up to 5400 (15 min x 60 sec x 6 rps) functions running at once. Working backwards, 25 medium instance provide (25 instances x 4 GB mem) an 100gb memory pool, or 100,000mb. That leaves around ~18mb for each of those 5400 jobs, if you don't consider OS resource overhead.

Looking at averages in this situation very possibly can give a warped perception of reality, but 25 instances doesn't seem out of the realm of possibility. I'm sure they have much more relevant metrics to back up that number as well.

Whether the functions really need this much time to run is another issue entirely, and hard to answer with the information given.


If they’re called rarely, they have poor latency due to cold start times.

IMO lambdas are most practical when someone else in the org is responsible for spinning up VMs and your boss is in a pissing match with their boss, preventing you from getting any new infrastructure to get work done. The technical merits rarely have anything to do with it.


"it costs 2x as much but doesn't require me to navigate my messed up IT org" is an underappreciated motivation for a lot of architectures you see people pump.

Your IT org political problems may not match, and therefore their architecture may not either. Also there's the question of what problem you are trying to scale at what scale, with what resiliency/uptime requirements, etc...


Agreed - I worked at a company that couldn't get some expensive security tool approved so they just bought it off AWS Marketplace. The AWS bill was high enough that the increase wasn't questioned.


I've worked at companies doing $2B in profits that had multi-month arguments about $100k on-prem production server refreshes due to more customers doing more transactions.

And yet I've been at companies 80% smaller/poorer who didn't notice $60k development environment costs because they were blended into a gigantic AWS bill.

The incentives are very very bad.


Mh, we've been looking at e.g. OpenFaaS to implement rare or slow and resource intensive processes as functions so they don't have to run all the time. Think of customer provisioning, larger data exports, model trainings and such things. Here, a slow startup might add a few seconds to a few minute long process.

But our outcome is: Outside of really heavyweight processes like model trainings, it's a lot of infrastructural effort to run something like this on our own systems, opposed to just sticking that code into a rarely called REST endpoint in some application we're already running anyway. We'd need a lot more volume of somewhat rarely executed tasks to make it worth running it.


Yes, but that wouldn’t require so many hoops to jump through to make work, and it would probably cost a lot less.

You have to use lambda so you can overcome artificial engineering constraints so that you can write blog posts.


Most of our apps, on a single t3a.nano (about $3/mo) can handle about 250 req / sec in stress tests. In sluggish python no less. People don't seem to understand modern compute speeds.


Does your app do more or less work?


They claim to need 25 medium or 6 xlarge EC2 instances to handle ~17M monthly invocations, which seems insane. I don't know everything they're doing under the hood, but I'd expect to be able to handle billions of requests with that much hardware considering the product offering.


I'm similarly mystified. I have first hand experience with that volume of traffic..

Per day. Per host. In 2009. On python.

How is everyone making everything so slow?


A cheap VPS could output much more, for 5 dollars a month.


Xeon has a long history, but even around 2005 50 rps from a single box with perl web app was not considered highload.


They talk about lambdas that exceed the max runtime of 15 minutes. For a single invocation. They go to great lengths talking about cron jobs, background jobs triggered by events, etc. I highly doubt the bulk of the lambdas they talk about are simple API calls. Otherwise yes, 6-7rps is peanuts, if we’re talking about API calls. And since the very first point of the article is them highlighting what goes to lambda and what does not (public api calls go to a dedicated box), I think it’s safe to say those 6 invocations/sec are definitely not API calls.

TL;DR your Xeon box is their always-on api box.


The biggest lesson i learned when i was in an org thay started using serverless heavily (because its the future!) is that its an unmaintanable mess. You end with code spaghetti, but now its split across 100 repos that might be hard to find, and designed purely on architecture diagrams.

From what i can see its basically recreating mainframe batch processing but in the cloud. X happens which triggers Y Job which triggers Z job and so on.


That sounds like micro services in general rather than lambda the Aws service selectively. I've seen the same unsustainable mess with k8s crazy teams.

The lesson I've learned instead is to start boring and traditional, then use serverless tech when you hit a problem the current setup cannot solve


I agree that its like microservices, but the problem is turned up to 11 with serverless. 1 microservice now becomes 10 lambdas. The issue is fundamentally one of discovery, and as teams churn out more and more functions, that arent all in the same repo, youre bound to lose traxk of whats happening.


Agreed. From what I have seen, lambda functions are typically even smaller than micro-services & you end up with even more of them.


Serverless has it's faults, but spaghetti code spread across 100 repos is definitely a "user error"...


How do you solve the discoverability issue when theres 1000 serverless functions written by 10 different teams then? Serverless worsens the issue of having knowledge of the entire system i think, and i dont think this is even solvable.


Lambdas are rarely entirely standalone, they support a larger service or glue services together.

Creating one repo per Lambda is going to make things messy of course, just as breaking every little internal library out into its own repo.

Regardless of the system or what it runs on, it's an easy trap to fall into but it's absolutely solvable with some technical leadership around standards.


> Lambdas are rarely entirely standalone, they support a larger service or glue services together.

I wish that was the case in real life. Unfortunately, the trend I've been noticing is to run anything that's an API call in Lambda, and then chaining multiple Lambdas in order to process whatever is needed for that API call.


One serverless function is effectively a http router that knows how to call the appropriate code path to reach the 1000 handlers.


  For our region that limit is set to 1000. That might sound like a lot when you start, but you quickly realise it’s easy to reach once you have enough Lambdas and you scale up. We found ourselves hitting that limit a lot once our traffic and therefore our demands from our system started scaling.
You can file a support ticket to have that limit raised.

https://docs.aws.amazon.com/servicequotas/latest/userguide/r...


That's really annoying though.. Why should I have to go out of my way to increase capacity given that i'm paying for it anyway?


Generally these limits exist so customers don’t accidentally spend more than they intend to — e.g. implementing a sort of infinite loop where Lambdas call each other constantly. Sounds implausible but I’ve seen that more than once!


IMO that's not why they _really_ do it. They have limits on everything because even at their scale they can't instantly accommodate your needs to suddenly scale or they need to prevent "noisy neighbor" situations where your sudden excessive usage impacts others' workloads. They still have to do relatively short term capacity planning to accommodate you. Like, I work for only a medium-large sized company and AWS has quoted us lead times of _weeks_ to make the instances we need for a workload available. We only needed 200-300 EC2 instances and they weren't even super unusual types. I think their infinite scaling on a dime claims are pure marketing jibber jabber.


> Sounds implausible but I’ve seen that more than once!

The textbook example of this going wrong is a lambda that is invoked on uploading to S3 that writes the result to S3. There's even an AWS article on it - [0]

[0] https://aws.amazon.com/blogs/compute/avoiding-recursive-invo...


We actually got an email from AWS recently at work that said “hey! Your lambda writes to a queue that invokes the same lambda, that seems wack”. We need it that way, but it’s enough of a problem that they built a way to detect it automatically.


I might believe this if AWS allowed customers to specify their own self-imposed billing limit. Do they have that feature yet?


I understand the sentiment behind your frustration - but it's worth noting that these support tickets are usually answered really quickly.

Specifically as it relates to Lambdas there's solid rationale behind these limits, but I agree that in many other cases the limits seem arbitrary and annoying.


The quotas are there for one good reason: the system running wild consuming way too much.

- for limited resources like IPs, it avoids one customer eating all the stock. Yes he’s paying for them, but other customers wouldn’t be able to get some anymore, generating frustrated users and revenue loss - for most other "infinite stock" resources, it avoids the bill exploding. It’s good for the customer, but also for the provider as they’re sure to be paid and not take a billing decline or sucking up all of a startup’s money.


One of the official reasons for the quota is to protect consumers from shooting themselves in the foot when they configure something incorrectly and start using the maximum available autoscaling resources which quickly makes bill to explode.


Given that AWS still has no billing caps (despite it being one of the most requested features), you're exposing yourself to uncapped downside.

In addition to lambdas being a poor architectural choice in most cases, that is.


AFAIK you can pretty easily cap the number of concurrent lambda executions. Of all of AWS's services, Lambda is probably the easiest one to configure limits on.


For lambdas in particular, you can set reserved concurrency, which is the most of a particular lambda that can run concurrently at any point in time: https://docs.aws.amazon.com/lambda/latest/dg/configuration-c....


The name serverless is missleading. Of course there is functions written in programming languages that runs on a real physicall servers in data centers.

This should be called Fuction as a service. Acronym Faas.


This _is_ called FaaS generally.

Serverless is kinda confusing though. Like whenever I say I work with serverless functions I always have to explain there are unmanaged servers involved


> 10GB is hardly enough. Once you import Pandas you’re on the limit. You can forget Pandas and scipy at the same Lambda.

This sounds way off to me. 10 GB to install a Python library?


I also thought this seemed extremely odd.

Here is an unoptimized example, built on an M1 Mac:

    $ cat <<EOF > Dockerfile
    FROM python:3.12-slim-bookworm
    RUN apt-get update && apt-get install -y python3-pip
    RUN pip install pandas
    LABEL "name"="python-pandas"
    ENTRYPOINT ["python3"]
    EOF

    $ docker image ls -f 'label'='name'='python-pandas'
    REPOSITORY               TAG       IMAGE ID       CREATED         SIZE
    sgarland/python-pandas   latest    89e31f6eb83d   9 minutes ago   764MB
A more optimized version:

    $ cat <<EOF > Dockerfile
    FROM python:3.12-slim-bookworm
    RUN apt-get update && \
        apt-get install -y --no-install-recommends python3-pip && \
        pip install pandas && \
        apt-get purge -y --autoremove python3-pip && \
        rm -rf /var/lib/apt/lists/*
    LABEL "name"="python-pandas"
    ENTRYPOINT ["python3"]

    $ docker image ls -f 'label'='name'='python-pandas'
    REPOSITORY               TAG       IMAGE ID       CREATED          SIZE
    sgarland/python-pandas   smaller   102308842b88   4 seconds ago    342MB
    sgarland/python-pandas   latest    89e31f6eb83d   27 minutes ago   764MB
Even adding in scipy didn't crack 500 MB:

    $ docker image ls -f 'label'='name'='python-pandas'
    REPOSITORY               TAG       IMAGE ID       CREATED          SIZE
    sgarland/python-pandas   scipy     808535284f03   3 minutes ago    497MB
    sgarland/python-pandas   smaller   102308842b88   9 minutes ago    342MB
    sgarland/python-pandas   latest    89e31f6eb83d   36 minutes ago   764MB
I'm not sure how they managed 10 GB. Here's the non-slim version, with no optimizations (this is much larger because `python3-pip` has the system default Python interpreter as as dependency, so this installs Python3.11 into the image):

    $ cat <<EOF > Dockerfile
    FROM python:3.12-bookworm
    RUN apt-get update
    RUN apt-get install -y python3-pip
    RUN pip install pandas scipy
    LABEL "name"="python-pandas"
    ENTRYPOINT ["python3"]
    EOF

    $ docker image ls -f 'label'='name'='python-pandas'
    REPOSITORY               TAG       IMAGE ID       CREATED          SIZE
    sgarland/python-pandas   bigger    f8f98e9a241c   8 seconds ago    1.44GB
    sgarland/python-pandas   scipy     808535284f03   3 minutes ago    497MB
    sgarland/python-pandas   smaller   102308842b88   9 minutes ago    342MB
    sgarland/python-pandas   latest    89e31f6eb83d   36 minutes ago   764MB


> serverless.. as long as we have enough money to throw on it

This one speaks truth.


The conclusions are what is already widely understood about Lambdas. They could have just researched the topic up front and chose a better architecture.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: