|
|
Bandwidth: hi / med / low Visit the Surf Shop! |
| Home | Register | CouchSearch! | My Profile | Messages | Groups | Events | Chat35 | Community | Info | Login |
|
|
CouchSurfing International is a Non-Profit company. |
The CrashA technical explanation of the crash by Casey Fenton
What Happened
Until that point, all of this data recovery up took about 36 long, nail-biting hours. While the techs at the data center were trying to recover the drives, we looked into other options of restoration. At first we assumed that we would be able to recover from a backup about 24 hours old but we soon discovered that the backup server only had saved backups about a week old. Since that older backup was all that was available we decided to use it and start the restoration process. It was only when we instructed our hired system administrators to begin copying the data off of the backup server that we discovered that at least 15 of about 100 tables were not in the backup set. Upon more investigation we discovered that the system administrators we employed had switched backup methods a few weeks prior. Since the database server had been running slow for many hours a day and members were complaining we asked the system administrators to do whatever they could to lessen the backup load on the server. They responded that they had changed the backup method so that there should now be minimal load on the server. We were on the road, traveling to Montreal at this time and did not have time to double-check the validity of the work they had assured us they were doing correctly. During the crash we figured out that the backup method that they had switched to was known as a rsync (file synchronization). They were copying the raw database data files to the remote backup server, which is a highly inferior way of backing up a database. Additionally, they had mis-configured rsync in one important way. The CouchSurfing MySQL database included two types of data tables. One type is called MyISAM and is for larger pieces of data that doesn't need to be accessed at high rates. MyISAM tables are smaller. Most of the 100 tables were of this type and were being backed up by rsync. The other table type is called InnoDB. This table type takes twice as much storage space as compared to MyISAM, but has the advantage that the data can be accessed by many server processes at once. The two table types were stored in different locations. The system administrators had been rsync-ing the MyISAM tables but not the InnoDB tables! The 15 InnoDB tables stored the most accessed information in the CouchSurfing website including profiles, friend links, references, etc. The loss of these tables, with no recent backup signaled the end of the CouchSurfing.com website as we knew it. There was no uniform recent backup. The most recent backup we had was taken just a few days before the crash, but it was a copy of the backup server's files. These were the incomplete backup files that were a week old. While taking inventory and see what other backup information we had we discovered a mixture with most of the important files being a couple of months old or more. Thirty-six hours after the crash, the data center informed us that they tried to recover the data but there was nothing that could be done. The data was not recoverable. With no recoverable data and no recent backups for the most important CS database tables, the website seemed to be irreparable. This was the time that I decided to make that announcement that CouchSurfing.com was finished. I wrote the letter explaining what happened to the community and posted it on the website, approximately 48 hours after the crash happened. What happened next was unbelievable. Within the following 24 hours we received more than 2000 emails of support from members expressing that they could simply not accept the demise of CouchSurfing, they wanted to help bring it back, and would have no problem re-entering their profile information. Many users expressed that they didn't mind if the databases were zeroed out and the community completely started from scratch. I was reminded that the CS community is not about the data, or about the furniture, it is about the network and the friendships that have already been created. The data was dead, but the community was alive. On Friday, June 30th, I left the Montreal Collective to remove myself from the intensity and take some time to reflect upon the recent events. A good night's sleep and some of Aldo's coffee revived me and I began to read many of the 2000+ emails that came pouring into CouchSurfing. It was clear that CS could not die. The community would do whatever it took to carry on. At about that same time in the afternoon the data center contacted us and indicated that they were trying to recover the data again. Apparently they had seen the letter I sent and wanted to do whatever it took to make sure that CouchSurfing.com didn't die. They assured us that they were working with data forensics experts to maximize the chances of recovering the data. As of the time of this writing, they report that they are still attempting to recover the data. We should know in a week if this is possible. That evening, with the support of the community, I started to develop a plan. We decided that it would be worth it to continue to develop CouchSurfing.com if the community would be willing to participate in an even deeper way and take on the majority of the workload. It was apparent that I just couldn't do all of the work myself. The plan was to gather as much data as we could and re-launch the site as soon as possible. The rest is described in this section of the website.
So, what exactly did we lose?
We were able to recreate empty "place holders" for those people who profiles didn't survive the crash. When you log in you will receive a message explaining that your profile was lost, but your username and password was recovered and your place held in the database. Unfortunately some other data was not so lucky. We lost up to several months of references, email, group post, and friend links. We've done our best to recreate the data. It was also discovered that the European server had about 36 Gigabytes of image cache data. We were able to transfer these cached images back to the North American database server and re-populate the image table, losing only a minimal amount of user photos. We will be recovering more data as time goes on. We've got the support of some of the best minds forensics data recovery and MySQL administration, including James Day of Wikipedia. As we recover more data, we will either merge it back into the website, or we will hold on to it if it is needed at some time in the future. We will make every effort to recover every possible piece of data that existed prior to the crash.
How are we insuring we never lose data again?
THANK YOU!
|
| Help / FAQ | Terms of Use | Privacy | Contact Us |
© 1999-2008 CouchSurfing International Inc. - a Non-Profit Organization 'CouchSurfing' and 'CouchSurfer' are registered and unregistered service marks of CouchSurfing International. - CS Release: Eagle** |