Author: dennis

  • Hard Drive Failure !

    100% uptime is impossible. All we can do is get close.

    Last week 2 hard disks failed (simultaneously !) in node 2 of cluster 2. Besides being backed up in real time on a second cluster node, each node is running RAID-5 with hot swap drives. A single drive can fail and be replaced with no down time. But if a second drive fails, it’s fatal.

    It wasn’t a clean failure and the performance of the fail-over system and failure reporting was less than perfect. Initial symptoms seemed to point at network card failure. The cluster software did fail over properly, but we had to clean up some databases. Some sites were not in good shape for 2 or 3 hours.

    The next day, the remaining node was bombarding us with emails about the failed node. I had to shut everything down and power it back up outside the cluster. Total down time with this was probably 10 minutes. This was necessary because otherwise we could easily miss emails about failures of which we are not yet aware. The signal to noise ratio was way too high.

    Monday we replaced all the hard drives in the failed node, re-installed the operating system and all the cluster software and began the process of manually syncing the drives from the node which was still in operation. Synchronization completed overnight last night, Tuesday night.

    This morning at 5 AM I began the task of moving services back into the cluster. I will spare you the details, but it’s a nasty and error prone process. All the safeguards, checks and balances in the cluster software really get in the way while doing this. Sites were up and down several times. My guess at total down time today was something like 30 minutes.

    Everything is completely back to normal now.

    This was the first major real world test of the clustered live fail-over system we put in place 18 months ago. I’m not totally happy with it. Previous tests were done by pulling plugs – total failures. In that situation, performance was flawless. Down time was so short no one noticed. Real world failures are usually messy like this one was. The fail-over system worked, but it needed a little help. It was still a big win compared to re-installing a server and restoring backups. That could take a day or more.

    There is a recurring pattern with problems like these. There is a period of a few days or a week during which problems come up and quickly or gradually get ironed out. These periods in retrospect feel like they are much longer than they really are because the worry and frustration when a server is down is intense. An hour is remembered as half a day. Related problems recurring a few times over several days is remembered as lasting a week or more. It’s human nature. Problem periods are followed by long periods, many months or a year during which everything runs smoothly.

    If you look at our up time in longer periods it’s actually very good. It’s something over 99.99%. My perfectionist nature often makes me lose site of that. But nobody does any better so it’s worth a reminder.

  • 10 YEARS !!

    The domain names deerfieldhosting.com and deerfieldhosting.net were registered on November 16th, 2002.

    A lot has changed since then.

    We had a reseller account on a server which was overloaded and provided terrible service.  BUT, the pay per click advertising was working.  The initial $300 was getting recycled over and over.  The revenue from new signups was slightly more than the advertising cost.  Since it was clear that the business model was working we took the plunge and rented our first dedicated server.

    We were running a control panel called Ensim, having taken a cue from the former account.  It was (is?) so bad that the number of accounts you could add to a server was constrained not by how busy the sites were, but by the overhead imposed by the control panel itself.  And I mean by a factor of 10.  The control panel hogged up server resources at a rate easily 10 times the sites.

    We limped along with Ensim for several years, adding server after server to get good performance.  What a relief it was to ditch it and move to cPanel and FreeBSD.  It felt like I’d gone to heaven.

    Of the first 5 customers to sign up, we still have 3.  The other 2 are no longer on the Internet.  I like to think that means we are doing something right.

    Anyway, Happy Birthday Deerfield Hosting !!

  • Domain Registrations and Name Servers

    It’s simple. A domain registration has the sole function of specifying DNS servers.  In order to find services provided for your domain name, a number called an IP address is required.  DNS servers map names to these numbers.  The DNS servers have entries like:

    BaileysKarate.com -> 216.185.152.158
    www.BaileysKarate.com -> 216.185.152.158

    They give out this information when asked.

    DNS servers are controlled by hosting companies so that they can provide services on IP addresses they chose. Customers register domains and set the DNS servers as instructed by their hosting company. When they switch hosting companies, they can modify their registration to use the DNS servers from the new company. And that is really all there is to that.

    To change hosting providers, all you have to do is log in to your account with us or where ever you bought your domain name and change the DNS servers it is set to use. Nothing needs to get transferred. It’s just a settings change which is very easy to do.

    Of course this assumes that you have access to and control your domain registration. Sometimes hosting companies subsume control to make it much harder for their clients to leave them. That’s a subject for a different post.  At Deerfield Hosting, we always register domains in our customers’ name.  If you paid for your domain registration, you should control it!