First off, let me apologize for everything that has happened. As Ken can probably contest, “protect the data” is a motto that I’ve been living by for some years. My second job had that written on a white board with permanent marker.
So here is the situation as it sits. About 3 weeks ago I took one of our two iSCSI server’s offline. It was the backup. The purpose of this was to upgrade the drive capacity of this unit and to them make it the primary, and then doing the same to the old one. When I say old, I don’t mean 10 year old hardware. The upgrade was actually schedule to go in place that next morning.
At exactly 2:00 PST Sunday morning, the virtual server I had running as the iSCSI target initiated a check and immediately caused the box to freeze. At around 5:00 am PST I received a call about email being down from a client. Checking into it I had one of the techs from the colo reboot the box. It came back just fine. Mind you, this wasn’t a hardware failure (as this is a very redundant machine) but rather a software hiccup.
I spent the next several hours attempting to fix the issue. The iSCSI is built on Sun Solaris 11 running zpool and zfs. The underlying disk drives are virtual (4 of them, each 512mb – VMware limitation) creating a 2TB iSCSI storage system. It sound’s small, but the underlying drives are high speed very redundant drivers to prevent failures. The problem is the 4 “virtual” drives that make up the 2TB share were marked as bad. Again, the underlying disks were 100%.
I worked with a Solaris tech who guided me through attempting to fix this problem. In the end the commands we worked through managed to get the drive working again, just with no data. So we stopped at that point as to ensure that we don’t cause any additional damage to the unit and we have contacted a data recovery specialist.
We are in the process of arranging terms of contract and shipping of the hardware to the consultant. At the end of this, regardless of data recovery, we’re looking at about $6,000 in total fees with no guarantee of data recovery. It will also take 10 days or so to find out whether the data is viable when it’s finished.
Besides the web site, I myself had amassed 10+ years’ worth of data, emails, etc. My data was on a second virtual disk array (so we’re actually talking about 2 x 2TB iSCSI shares worth of data).
This is a series of unfortunate events in that we’re been very diligent about leasing redundant hardware (high end sun boxes, commercial raid solutions, etc.), software, and using as much underlying protections as we can, and yet we still failed you, our community.
The hosting site of this business has been generating us little revenue over the years. We have kept this in place mostly because it breaks even, and we have a lot of friends that utilize out services (as well as a few valued customers). In 2013 alone we’re upgrade the bulk of our hardware to plan for the future of the business and to better support the existing clients that we have. Unfortunately that’s put us in a tough situation. We will continue to attempt record data until it is deemed that we cannot recover it. There is no ETC for that at this time. In 15 years we’re never lost any significant data and have been able to recover from my outages within less than a day (worst case scenario) with 100% of that data intact. This is Murphy ’s Law; the small period of time when you’re not protected is when you will fail. That was a 3 week span out of 15 years.
I have been working painstakingly to get everything up to at least a usable state (I’m literally 100 hours in to a 6 day week right now). For the AA site, there will need to be some performance tweaking, and I’m going to have one of my guys on it once we get the remainder of our clients back online.
Ken has done an outstanding job with AA over the years and I know he’s put his heart and soul into it, and as a personal friend since college (a lifetime ago) I hits me hard to know that I’ve impacted him (and his users) in this way. I was watching the site it received it 1 millionth his after 2 years, I was there when it hit 1 million hits per day (2005 ish ??).
With all of this you have my heart filled apology.
From the Hold Stead perspective, we will be winding down the commercial hosting, my friends sister company will continue to run this site (as a few others) as they are the ones that leased us the hardware. So this will go on…
As we will rebuild, bigger and better, we must…
AKA Gary Smith