RAID is not a backup solution, it is proved again! I was planning to write my experience of 48 hours from July 22 7:17 to July 24 7:23 GMT -5, couldn’t really manage to get some time. All the users who were in the Hemonto server should be aware about the recent issue we faced with our RAID. This post is just to elaborate how did we handle the situation.
WE have started using CDP 3 Enterprise Edition for our backup since June. Nearly 75% of our servers are now using CDP 3 and 25% are using CDP 2. Since we have deployed CDP 3, we were seeing some interesting file system issues. It looks like the CDP 3 mount the file system each time it starts the hot copying the snapshot. After a couple of backups we were seeing more orphan inodes being deleted from the hemonto server each time CDP 3 was trying to backup. It raised a little alarm. I was expecting some sort of file system corruption, although before going into rescuelayer and start fscking the system, I planned to check the RAID status. All of the drives were reporting optimal and there wasn’t any ticket from Softlayer regarding the RAID maintenance (Usually the raid alerts are sent to SL technicians automatically). I planned to go deeper and checked the logs of the controller. It was reporting nearly 37K errors for our 3rd drive, 20K error for our first drive and 1K error four our second drive. These numbers are usually pretty low in a system where Drives are in perfect condition and the RAID controller is operating correctly. I have seen situation where the Adaptec RAID can automatically heal these errors and they go unattended many times. I kept monitoring the number of errors for nearly an hour and could identify 2 of the drives were surely went wrong.
I opened a ticket in Softlayer support asking them for a Chassis swap at first to see if the issue is related to RAID controller and replace one of the drive which was creating most the error. Just for a note, this server is using RAID 5 and it has nearly 1.2TB of user data. Due to the nature of RAID 5, if 2 drives fail, it would be impossible to recover the data from the drives. RAID 5 can recover the data of single drive fail. Before going for a chassis swap, we had run a different version of offsite backup with CDP 2 instead of CDP 3 just to make sure, we are free from any sort of data loss in future. It took nearly 18 hours to complete the server to secure the data with CDP 2 in an offsite server at a data rate of dedicated 100Mbps. As soon as we had completed taking a fresh copy of the backup, we went to Rescue Kernel and ran a file system check. Most of the partition came clean, although one of the partition was returning error. We had completed forcefully running and fixing all logical errors of the file system before going for a chassis swap. There wasn’t a lot of data loss due to the error. There were couple of inodes referring orphan and was fixed by the fsck.ext3
After completing the fsck, I had sent the server for a chassis swap. It took nearly 1.30 hours to complete the chassis swap. After the chassis swap, the server came back online. I immediately took a look at the RAID log and was able to capture that it was reporting nearly 100 errors for each of the drives just after the OS was booted on new chassis. I was quite sure, under this circumstance, we have to go for restoring the backup from the backup server as all the drives might fail instead of rebuild. Although the server build engineer “Robert” from Softlayer (He was the shift leader at that time, it was sunday) took the server down back again to replace the drive, but couldn’t bring back the server (As expected). Finally after nearly 3 hours, he was able to discover that, one of the drive was absolutely failed. He was able to get back the server in usable state with the other two drives after his investigation was complete and replacing the RAID controller with a higher model number. Softlayer employees ran a fsck on the system before starting, just to make sure, the file system is out of any further error before rebuild.
Once the fsck was done, it went to replace one drive by one. We ran all the rebuild from BIOS instead of running the OS. I was a little skeptical about the condition of those drives and feeling safe to run it from BIOS. We were checking the status of each rebuild using IPMI. It took nearly 3 hours for each rebuild and nearly 40 minutes in the middle to replace a drive after each rebuild. After we completed replacing all 3 drives and rebuilding the whole array 3 times, Softlayer employees brought back the server online.
Right now, Hemonto server is running with all new hardwares. Fortunately, there wasn’t any data loss. Even if it did, probably we had to wait long, but as we had two version of data backed up, we were surely able to recover the data using one of the CDP version. We had to run couple of cpanel fix script to fix a couple of permission errors. /tmp partition came with uncorrectable error after the system was online. We immediately went to Rescue Kernel and recovered the superblock information with mk2fs and ran a fsck. It brought back the /tmp partition in action again.
After something above 48 hours when the assessment on the new drives and hardwares were completed, I went to sleep. It was hard, when you have employees, but you respect your client’s data the most, you can’t sleep well without making sure they are secured. I should thank all the Softlayer employees worked with me and my colleagues to patch this hardware disaster. We had nearly 4 CSA tech working for the job due to the shift interchange. There were 4 SBT (Server Build Technician) working to replace the chassis and drives each time a rebuild was performed. Hardware issues are not uncommon but such issues are rare when you have your RAID controller along with your all hard drives are failing.
This is just another story explains RAID is never made for backup. So, keep your backup offsite to be safe!