Two of the sites below are driven off flat files, the third is driven off of a database. Can you spot the difference?

One of these things is not like the other |

Now this is not to pick on Mark, or WordPress, or MySQL.
The reason I picked Mark, Sam, and myself is that we are all
hosted at cornerhost, and yet only one of us is having a problem,
and that problem stems from `N = 1`

thinking.
In this case Mark's blog is running on
a MySQL database which is served off another machine.
You might think that running the database off another machine
would make his site more robust, but instead it actually does
the opposite.

Since the web server is on one machine and MySQL on another, the chance of Mark's site going down is:

`1-(1-P(F`_{1}))*(1-P(F_{2}))

That is, one minus the probability of the Apache server machine not failing (over some set period of time) times the probability of the MySQL server not failing (over that same set period of time). If we assume, for generality that both machines have the same chance of failure, then we get:

`1-(1-P(F))^2`

Sam and I both serve our sites off local flat files, so our probablity of failure is limited to that one machine. Which gives us:

`P(F)`

When I was talking about `N = 1`

systems this is exactly
the type of thing I was talking about. What if the database
was an `N > 1`

system, say composed of three machines, again all
with the same failure rating, and let's assume that the web serving
was done the same way, with three machines all running a web server. Then the
chance of a failure is that either all three web server machines fail, or
all three database machines fail. That gives us a chance of failure of:

`1-(1-P(F)^3)*(1-P(F)^3)`

That is, one minus the chance that either the MySQL server cluster doesn't fail and the web server cluster doesn't fail. The chance that a cluster doesn't fail is one minus the chance that all three machines in the cluster fail.

Let's plug in some numbers to see how dramatic a difference this makes. If we happen to have an outrageously high failure rate like 1% per machine then Mark has a 2% (0.199% actually) chance of failure, while Sam and I both have a 1% chance of failure. Compare that to the replicated scenario where we have three web servers and three database servers, which has a chance of failure of 0.0002%.

That's pretty dramatic. But what if we just lived in a world where
`N > 1`

and databases and web servers were just built
that way, and let's also presume that there are just two machines, like
we have today in Mark's WordPress install. What does a change of
failure look like there? In that case, you would only fail if both
servers went down at the same time:

`P(F)*P(F)`

Using our 1% failure chance from above we get a failure chance
of 0.01%. That's quite a bit different from the same two machines
today with an `N = 1`

legacy providing a 2% chance of failure.

Think about it; that dramatic increase in reliability, close to two orders of magnitude, comes not from hardware, or Moore's Law, but from designing software differently.

**N.B.**: Updated to do the math correctly for `N = 1`

cases.
See the comments for the discussion. The numbers don't actually change, but that's just
an artifact of me choosing a probablity of failure of 1% and not of my math actually
being right the first time.

Posted by Dave on 2007-07-31

Posted by Jess Austin on 2007-07-31

Posted by Jerry on 2007-07-31

Yes, I failed to include the case of overlapping failures, and even your analysis isn't quite correct since this is a probability of failure over a range of time, and I didn't say how long the server would be down once it failed, etc. The rough numbers are, however, good enough to get the point across.

Posted by joe on 2007-07-31

*N > 1*thinking?

Posted by Aristotle Pagaltzis on 2007-07-31

You still say "That is, the probability of the Apache server machine failing times the probability of the MySQL server failing", even though you've corrected the math.

Also I think the simplistic example is **very generous** for the Apache-httpd + MySQL case, as you are taking a single "chance of failure" percentage for each entire machine.

In reality both machines probably have separate HW, probably have maintenance periods at different times (for both software and hardware), different backup times, you have two pieces of software that need errata at different times, with that config. you might well be relying on a second network card in the Webserver (plus wires, plus switches) so it can speak to the internal network which the DB is on, unless you are insane you'll be relying on internal DNS working, relying on someone not screwing up the Apache-httpd config. (or firewall/SELinux/etc.) so it can't speak to the DB, etc. etc. etc.

And that's just off the top of my head, lumping all of those into P(F2) and that'll be much higher. So a real comparison could easily be:

P(F1) vs. P(F1) + P(F2) == 0.1% vs. 0.1% + 0.1% + 0.2% == 1 failure every 3 years vs. more than one a year

Posted by James Antill on 2007-08-01

Joe: why not? There’s no central database server machine involved. Heck, there’s not even a central database server *process* involved. So what does *N = 1* actually mean? And do your uptime probability calculations have anything whatsoever to do with it?

Hint: I think your argument consists of a series of individually valid points adding up to a non-sequitur.

Posted by Aristotle Pagaltzis on 2007-08-01

Posted by Robert Synnott on 2007-08-01

Using SQLite would be the same as Sam and I using a flat file store, we're using `N = 1`

software and sticking to a scenario where the number of machines equals one. Adding more machines to a system **should** increase a systems reliability, but as of today most software is written with an `N = 1`

mindset and so moving to more machines actually is either not possible, or is more fragile.

No, I don't want to take any of that into account. What I wanted to do was to show with a very broad brush that:

`P(F1) + P(F2) + ... + P(F`_{n}) ≫ P(F1)*P(F2)*...*P(F_{n})

Which is the difference between `N = 1`

and `N > 1`

thinking.

I've now changed 'times' to 'plus'. Thanks.

Posted by joe on 2007-08-01

_{1}))(1-P(F

_{2}))). This is the inverse of the probability of both being online, or the probability that at least one is offline. For the 75% chance example, that shows a much more sane 93.75% chance of either one failing. Given the 1% chance of failure, the chance of the two server system failing is actually 1.99%, not 2%. Close, but it makes a difference.

Posted by Logan on 2007-08-01

Posted by paul on 2007-08-01

Agreed, I had the math wrong, I have now updated the post accordingly. Thanks.

Posted by joe on 2007-08-02

Posted by Craig on 2007-07-31