In Defense of Optimization Work

It is common knowledge that hardware is cheap, and programmers are expensive, and that most performance issues can be easily solved by throwing more and bigger hardware at it. But is it really cheaper in the long run? Is there still some room for optimization work?

"The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming."

Donald Knuth, [or Tony Hoare according to Donald Knuth, or Edsger W. Dijkstra according to Tony Hoare](https://shreevatsa.wordpress.com/2008/05/16/premature-optimization-is-the-root-of-all-evil/)

Yes, we engineers like to optimize, spend time shaving off bytes and microseconds from our systems. And it is often at odds with the requirement to build features and squash bugs instead. Before we get into the cost comparison of optimization work VS hardware, allow me to reframe the issue: performance is not a goal isolated from other engineering business issues. Like security, it is a transversal problem:

  • performance is a reliability issue, because it will trigger bigger and more frequent production issues
  • performance is a user experience issue, it affects directly how your users will work with your systems
  • performance is a cost issue

And like security, it is not something we can ignore on a whim, it is part of the job.

Moreover, while we hear about hero stories of optimizations involving kernel ork or writing assembly manually, most performance work is actually quite simple, and boils down to tasks like:

  • adding a cache
  • adding database indexes
  • doing some work inside a database query instead of loading all the data and analyzing manually

It is actually rare (in web services) to have some optimization work that requires someone to write assembly manually or other fun tasks like this.

The cost of not optimizing code

If we take $100,000 as average programmer salary (in the US), it comes down to around $400 per day, so $50 per hour (20-21 days per month, 8 hours per day). So spending an afternoon optimizing code would amount to $200.

Using bigger or more hardware looks like a small cost comparatively. Let’s assume adding another machine would cost $20 to $50 per month. Let’s choose $20/month. The naive calculation would show that we recoup the costs of optimization work after 10 months. So at this point, thinking 10 months in advance might not be too interesting, and the cost of the machine is not too high. But after those 10 months, it starts costing more than having an engineer look at it.

Because here is the first difference: an optimization task is a fixed cost, done once, while the hardware cost is compounded (if the hardware is bought instead of rented, it will be amortized until the date of replacement). We easily trick ourselves into comparing the local, monthly cost, without looking at it on the long term.

But there is another aspect of the problem: performance issues are linked to business growth. Often not directly, but through some usage metric, like a number of messages sent, or a number of searches per second. Those grow with the number of customers. But they also grow with customer usage: if everything goes well, customers will use the product more and more. Do not expect performance issues to follow business growth linearly.

Let's take as example one of the metrics to evaluate a startup's growth: the growth rate should be around 5-7% per week

Let's use the number of users as growth metric, and ignore usage growth.

Assuming that one node of the application is at full capacity with the current number of customers N, we add another node.

At 5% per week, we’re around 21% growth per month (21.550625% exactly, and that's calculated as a geometric progression). We will double the number of customers in just 15 weeks!

We can see the costs in the following graph. Red points represent the monthly cost, while blue points indicate how much we spent so far.



Red points: monthly cost — Blue points: what we’ve spent so far

So we would spend: 1 more machine for the first 4 months, then we need to add one more. We will reach 3N at week 24 (6 months). We already spent $200 more than one machine. So this engineering time would pay for itself in 6 months, not 10. At week 40, we have already spent $560. Next week we will add another machine, because we will reach 7N.

After a year, we reached 12N, and paid $1140. We likely got other performance issues linked to the number of machines running, on-call issues, time spent updating them, etc. (But for all of those issues, we have Clever Cloud!) So we probably have more than 12 machines doing the work, and we spent considerable engineering time making them work.

Here's the equation: for a growth rate G, a number of months M, and a monthly cost C for one machine:


You can also test it in Wolfram Alpha (here for $20/month and growth at 5%)

To sum up, choosing the current settings as:

  • $200 of work (4h at $50/h)
  • $20/month for a machine
  • 5% growth per week

We would spend $200 in new machines over the next 6 months, which is pretty short term. But there’s another way to look at it. We still need to fix the performance issue, but adding more hardware would buy us time. For the next 4 months, we would pay only $20 per month to delay the issue, and let our engineers work properly on it instead of putting out fires. Even better, now we have the means to plan hardware costs following business growth.

"Hardware is cheap, programmers are expensive;"

…but performance debt comes with interests (/¯–‿・)/¯

The author, posing with what seems to be a good book.

Blog

À lire également

SuperBOL: The COBOL revolution in the Cloud

COBOL, a programming language that is over 60 years old, continues to power a large proportion of the IT systems of the world's major companies, particularly in the financial and insurance sectors.
Features

Clever Cloud welcomes the first startups to the UP Programme

Clever Cloud is proud to announce the arrival of the first five startups selected to join its UP Programme, an initiative dedicated to supporting young technology companies in their growth phase.
Company

A minor update resulted in a cascade of errors: how it went wrong, what we’ve learnt

On Friday, August 2nd, 2024 Clever Cloud’s platform became very unstable, leading to downtime of varying duration and scope, for customers using services on the EU-FR-1 (PAR) region, and remote zones depending on the EU-FR-1 control plane (OVHcloud, Scaleway, and Oracle). Privates and on-premise zones weren’t impacted.
Company Engineering