Category: The Slightly Technical

Simple little hows and whys.

  • Hard Drive Failure Rates

    In the last 6 months we have twice had nearly simultaneous drive failures leading to service outages.  It was hard at first to grasp how something so seemingly unlikely could have happened.  When it happened a second time, it was time for some serious scrutiny.  What seemed like common sense might not be correct.

    We run servers in pairs.  A hard disk on one server has a corresponding hard disk on another server.  If one disk fails, the service can simply be powered up on the other server.  Both servers have to be down to cause an outage.

    The solution became obvious once the problem was understood.

    Google released a study of their experience with a very large population of hard disks and failures.  If you have a taste for a dry technical paper, you can find it here: Google media research  What they found was revealing.  The data set is based on consumer grade drives.  We use enterprise grade drives which have a much longer life, but the general observations will be about the same.  This summarizes failure times:

    afr_age

    As expected, hard disks show a high infant mortality followed by a period of (in our case) several years of reliable service.  Then suddenly failure rates increase.  That there is a decline in failure rates at 4 years is unexpected, but a gradually increasing rate after that is just what you might expect.  There is no data to support a rise and fall like that with enterprise drives.  It may or may not happen.

    When we think about reliability, what we want to know is the likelihood of a failure event in a given time interval.  Then we can make statements (these are made up numbers) that the odds of a drive failure in a server over a months time are 1 in 300.  Then if a drive in a second server which is being mirrored to is the same, the chance of both drives going down in the same month becomes 1 in 600.  Since replacing a failed drive and re-mirroring takes 2 days, that would make a 1 in 9,000 chance of a failure before we could recover with no down time.  That seems reasonable enough, but it turns out not to be correct.  The problem is the failure rate distribution.

    Many people are familiar with “the bell curve”, what in statistics is called the normal distribution.  The graph looks like this:

    Empirical_Rule

    If you tossed a coin 5,000 times and kept track of how many times in a row you got heads and tails and graphed it, that’s what it would look like.  The left being heads counts and the right being the tails counts.

    Hard disk manufacturers supply a statistic meant to show product life called the mean time between failure – MTBF.  If the number was 5 years, the expectation is that most drives would last about that amount of time.  What they report generally doesn’t relate to reality very well as the Google paper shows.  Still, it’s a useful statistic.  If the MTBF is 5 years and we charted a large population of disks, you would expect the chart to be a normal distribution with 5 years being the top of the curve.  Lacking data, my guess at the standard deviation of a set of 5 year MTBF drives would be something like 3 to 6 months.  Failure of a specific drive is random within a time frame so it’s reasonable to expect a failure curve to look something like a normal distribution.  We are (were) working with 2 sets of hard drives all manufactured at the same time, all in exactly the same kind of server and in service for exactly the same amount of time.  What that means is that the top of the curve is going to be much narrower and the sides much steeper.  In statistical terms, the standard deviation will be a much smaller number.

    So, the obvious solution?  Add randomness.  Add new drives, but not all new drives.  The older drives have life in them yet.  Besides being a waste of money it would lead to the same situation if we simply replaced all of them.  What we have done is replace half of them.  Each replication pair consists of an older drive and a newer one.  When an older drive fails it will be replaced by another older drive until we run out of them.  Introducing new drives will therefore be at relatively random intervals.  This will move the top of that curve all over the place in terms of single drives.  We may not see the odds against double failures as high as 1 in 9,000 but clearly it will be a huge improvement.  It would be nice to have actual data for predictions.  We don’t, so I will have to make a guess.  Based on a lot of consideration, 1 in 1,000 seems reasonable.  It’s also a number we can live with.

  • Your Domain and Google Search

    We are often asked about the results when a domain name is typed directly into a Google search box. It’s helpful to understand a little bit about how Google searches work. It’s a gigantic topic and we’re only dealing with one small corner of it here.

    What Google tries to do first is find things other people have searched for. When you click on a particular result, they record the click. The idea is that since you clicked on it, the description probably was a match for what you wanted. Next time a similar search is done, your click tends to move that result up to a higher position.

    Next, people often make typing mistakes and Google attempts to correct them. In theory, this saves the customer from wasting time and needing to re-type and saves Google from wasting resources on bad searches.

    It gets a lot more complicated than that, but that’s the beginning of how it works.

    When you type a domain name into a search box, Google is likely to recognize it as a domain name. But the same logic will be applied, with results you may or may not like. I was just asked about the domain name miniatureangels.com. Google returned, “Showing results for miniature-angel.com” – NOT what was wanted. Apparently that’s a popular site.

    Unfortunately there is nothing to be done about this. We sometimes hear from customers who are upset that something like this is happening to them. There isn’t anything magical or mystical going on and it wouldn’t matter where or how your site is being hosted. Website content may have some effect. It has nothing to do with hosting at all.

    I tend to regard typing a domain name into a search box as a dumb thing to do. After all, if you are looking for the web site for a domain there is no need for a search. Just go there. The trouble is, huge numbers of people where introduced to the Internet by simply sitting down at a computer with a Google search box in front of them. They typed in what they wanted and found it and that is the end of that. The principles of least thought and least resistance have coincided and that is what they will do evermore.

    Our customer who owns miniatureangels.com wants to replace that with miniatureangelsfarm.com. Probably a good idea.

  • Domain Registrations and Name Servers

    It’s simple. A domain registration has the sole function of specifying DNS servers.  In order to find services provided for your domain name, a number called an IP address is required.  DNS servers map names to these numbers.  The DNS servers have entries like:

    BaileysKarate.com -> 216.185.152.158
    www.BaileysKarate.com -> 216.185.152.158

    They give out this information when asked.

    DNS servers are controlled by hosting companies so that they can provide services on IP addresses they chose. Customers register domains and set the DNS servers as instructed by their hosting company. When they switch hosting companies, they can modify their registration to use the DNS servers from the new company. And that is really all there is to that.

    To change hosting providers, all you have to do is log in to your account with us or where ever you bought your domain name and change the DNS servers it is set to use. Nothing needs to get transferred. It’s just a settings change which is very easy to do.

    Of course this assumes that you have access to and control your domain registration. Sometimes hosting companies subsume control to make it much harder for their clients to leave them. That’s a subject for a different post.  At Deerfield Hosting, we always register domains in our customers’ name.  If you paid for your domain registration, you should control it!