Database Recovery from RAM

Introduction

Recently I had a client who basically lost his SQL Server databases. I decided to share this experience with you, so you can use it if you ever find yourself in so tough situation. The client found out that backup job was failing and it was failing undiscovered for quite some time. The moment they found that out was the moment when their RAID array failed and corrupted the database files. Application started to throw IO errors at users. Storage failed, files unreadable and no backups – time for me to do my magic!

The Details

SQL errorlog was full with:

I/O error 1117(error not found) detected during read at offset 0x00000445a8c000 in file 'E:\some_path\data.mdf'

1	I/O error 1117(error not found) detected during read at offset 0x00000445a8c000 in file 'E:\some_path\data.mdf'

Command “net helpmsg 1117” reveals the meaning of the error 1117 (the method you can use for similar log message codes):

The request could not be performed because of an I/O device error.

1	The request could not be performed because of an I/O device error.

Backups were all wiped-out and overwritten because backup job was deleting old backups even when new backups are not created successfully. Backup was not properly setup, no alerts, but hey – I am not complaining. Thanks to them I earn for a living 🙂

Normally, when storage is broken, data files not readable, no backups, and no new backup can be taken (because of storage failure) – most of DBAs would tell them to “kiss goodbye” with their databases and better prepare recovery strategy next time. But I didn’t gave up so easily. I ended up with saving their databases with almost no data loss (few fields of several rows were blank, but everything else and all the rows are saved, I will explain later), and here is briefly how I did it:

Investigation

First, the DBCC CHECKDB revealed exact places where data is corrupted or unreadable. Like this:

Server: Msg 8966, Level 16, State 1, Line 1
Could not read and latch page (3:2283615) with latch type UP. Latch failed.

Server: Msg 2533, Level 16, State 1, Line 1
Table error: Page (3:2239810) allocated to object ID 1219951868, index ID 0 was not seen.  Page may be invalid or have incorrect object ID information in its header.

Server: Msg 8966, Level 16, State 1, Line 1

Could not read and latch page (3:2283615) with latch type UP. Latch failed.

Server: Msg 2533, Level 16, State 1, Line 1

Table error: Page (3:2239810) allocated to object ID 1219951868, index ID 0 was not seen. Page may be invalid or have incorrect object ID information in its header.

Many places could be repaired without loss, like NC indexes and indexed views, but there was one really important table with corruption in the clustered index. And the minimal DBCC recovery option was “allow data loss”.
Good thing is they haven’t restarted the SQL Service, so many of the pages were cached in the buffer pool memory. I extracted as many data I could from RAM, using various tricks to select only cached rows to avoid error. Many columns could be extracted from healthy NC indexes of that table, but not all columns. Now I had all rows (but not all columns) copied to a freshly created database, but I could not make a backup to create a copy of the database – it was failing.

The right tool to copy damaged files

After exctracting everything possible from RAM, I shut down the SQL Service and I tried to copy corrupted data and log files. But, copy-paste failed because of IO errors. Also tried xcopy and robocopy – both failed to read the files.
I needed a tool that can copy corrupted files and replace unreadable parts with zeroes so the file size would stay the same as original. That is important because if we just skip or cut-off unreadable parts making copied files smaller, references to pages (page id) would not point to the right page anymore because they are positional, and some special system pages (SGAM, PFS, etc) would not be where expected, resulting in totally corrupted database. We need a tool that would copy readable parts, zero-out unreadable parts, and retain the original file size.

I tried first with Roadkil’s Unstoppable Copier. It takes forever, close to 12h of retrying (set to only 1 retry) before I cancelled it. Plus, I am not certain will it skip damaged parts so resulting file would be shorter and thus totally unusable/corrupted. I do not recommend that tool, at least not for SQL recovery.

„Bad Block Copy“ command-line utility saved the day: http://alter.org.ua/soft/win/bb_recover/

bbcopy.exe -f -r 1 -b 512 -Bk 64 -l bbcopy.log "E:\damaged_drive\data.mdf" "G:\healthy_place\data.mdf"

1	bbcopy.exe -f -r 1 -b 512 -Bk 64 -l bbcopy.log "E:\damaged_drive\data.mdf" "G:\healthy_place\data.mdf"

It copied all the files very fast, in about half hour, replacing only 8192 bytes (a single page) with zeroes.
That translated to 24 rows lost in one table.
But, since I partially reconstructed those rows from NC indexes that were previously in the buffer pool, only several columns of those 24 rows were actually lost. There was enough info in saved columns that the user could identify 2 order items described by those rows. They found those orders somewhere, and entered them manually into the system.
Final result – no data is lost, the battle is won!

Now they can set up proper recovery and alerting strategy to avoid this in the future.

Conclusion

If data corruption occurs, DO NOT restart the SQL service, do not detach your database, do not do any action that would flush the buffer pool cache out of memory. This is important because otherwise lost data cannot be extracted from RAM. Try to find backups, even old ones. Search for big files on all servers, PCs, laptops, everywhere. If not found, recovery is a painful process. But if done with expertise, with a few tricks, a total loss might be turned into “no data loss”. SQL Server databases are incredibly good at saving and protecting your data, even when they are not properly protected. In the right hands, data might be saved from seamingly impossible situation.