Database recovery IMPORTANT rule

Recently, I had a case with a customer in a really difficult situation. They accidentally deleted important data and didn’ know how to recover, a sad story. Here is how it went (I’m paraphrasing the dialog):

Customer: We accidentally dropped a column, VARBINARY(MAX) with lot’s of important documents. It happened 3 days ago and our efforts to recover were in vain.

Me: Well, restore from a backup.

Customer: We wanted, but found out that database does not have a backup. The latest one was 2 years ago.

Me: Ok, then do a log backup so we can maybe restore log with STOP just before time of that accidental DROP COLUMN.

Customer: That database is in SIMPLE recovery model. No log backups are possible.

Me: You keep important production database in SIMPLE recovery model!? Ok, there still is a good chance to recover. DROP LOB COLUMN does not really remove the data, it just deallocates, marking the pages as free space inside the data file (mdf). We can use certain tools like ApexSQL Recover, to scan for this free space and dig out the data. Just please, ASAP, put that DB offline and COPY mdf and ldf files to a safe place. Then put DB back online for users to continue. That will be our starting point to return to for recovery attempt.

Customer: Hm. Copy of mdf/ldf? Why don’t we just take a backup, restore it and start from there each time?

Me: Because dropped data is in pages marked as “empty space” inside the data file. Backup takes only used pages from data files, skipping “empty” parts. Which is logical, expected and normal.

Customer: Khm, hm. We took a full backup (after DROP COLUMN) as our recovery starting point. So you are saying that backup is not good enough?

Me: Exactly, database backup does not contain “empty” pages with deleted data that interests us. Moreover, when you do a RESTORE, by default first step is to zero-out (write 0’s) to entire area of future restored data file. Then data is copied from bak into that zeroed mdf. (Unless “Perform Volume Maintenance Tasks” privilege is given, which isn’t). So all “empty” pages are zeroes after restore, no leftover data is there on restored database.

Customer: Khm, hm, caugh. Errr, we tried a thing, that didn’t work, then we decided to revert back to a starting point for another try. We did the revert by restoring the backup over the prod database…

Me: Wait, what? You took the backup AFTER data was deleted, and then you OVERWROTE the production database with it!???

Customer: Yes.

Me: #$%#! Why didn’t you restore aside, under a different name at least??? Do you happen to have any disk-level snapshots/backups or VM-level snapshots/backups at ANY time prior the RESTORE? I mean, VM snapshot BEFORE you do something risky would be a logical move, right?

Customer: Yes, we took one VM shapshot AFTER the restore (aka after data was wiped from empty space). We didn’t do any snapshots before that.

Me: :facepalm: I don’t know how to tell you that, but your data is GONE! You did all you could to wipe it out completely, as some attacker would. Probably was not intentional, but it is a remarkable array of mistakes.

CONCLUSION

When accidental DROP/DELETE occurs

  • Do a VM snapshot immediately. That is your “point to return to” for another recovery attempt.
  • Use backups if you have them, and restore to a DIFFERENT database name. Maybe even different server. Do NOT overwrite that prod database!
    • FULL recovery enables you to restore to ANY point in time, even between two log backups.
  • If there are NO backups, put the DB offline, copy mdf/ndf/ldf aside, put DB back online.
    • This needs to be done FAST, before that empty space is overwritten with new data!
  • Ask expert’s help or use a recovery tool on a CLONE of the database (backup/restore won’t cut it). Short advice is usually free from most experts (from me is). Even if not free, or longer action is needed, it is better to spend some small amount of money than to lose important data. It is an investment to learn how to avoid future fails.

Be prepared! (not surprised)

BASIC part

  • Implement automatic checks “does backup exist?”. For all databases, a recent full/diff backup in the last 24h should exist.
    • I have a ready-made system to do just that.
  • Make sure important databases are in FULL recovery model
    • Unless you are ok to lose last 24h of data. Like some DWH databases are.
  • Make sure you scan for servers that are not on your list. There might be ones important, but out of your radar (as in this case).
    • Scan the network occasionally (automatic?) for SQL Servers eg. by using MAP Toolkit.
    • You would be surprised to find some servers you supposed to take care off and didn’t know they exist.
  • Do a VM snapshot of DB server before you do anything risky. Like:
    • BEFORE upgrading/installing a software.
    • BEFORE trying to recover deleted data.
  • Installations/upgrades/risky things should go to the test servers first.
    • Or at least on less important prod servers before implementing them on the most critical ones
    • Do NOT let ANY software, including Antivirus/security (CrowdStrike, I’m looking at you!) and OS patches install automatically. AV definitions are ok, it is not a code change. Install on your watch, controlled way, and verify after install does everything works fine.

For extra security

  • Implement automatic “restorability checks”.
    • I have a ready-made system that not only verifies restorability by restoring full, diff, and all log backups, but also does a CHECKDB on restored databases, offloading the production of that resource-intensive task. Then it drops the DB and restores the next one. Email is sent if restore fails or checkdb fails.
    • You would be stunned to see how often for various reasons backups you THINK you have are not restorable. For example, someone takes a full backup without copy-only to the laptop and goes to vacation. From that moment, all DIFF backups are not restorable till the next FULL backup, because full backup they are based on is not accessible. It’s on that guy’s laptop who’s SSD maybe just failed.
  • Implement automatic storage SNAPSHOTS for storage with backups.
    • Even if attacker/ransomware encrypts your backups, snapshot holds unencrypted version. Though a bit older, so some data loss is expected.
  • Implement storage HARDENING procedure for your storage model
    • Storage with backups should be hard to penetrate. Guard it with you life (or resume).
    • Eg, if your storage for backups is Synology NAS, google “Synology hardening” and implement that.
  • Storage with backups should be a DIFFERENT device from prod data/log disks
    • That is a minimum. It could also be a different rack, room, data center, tectonic plate or planet. Just don’t put backups on the same storage as the data it guards! (this one should be in the “basic” section now when I am thinking about it)
    • People who do backups to a “local drive”, yes, YOU I am referring to! When that storage fails (and storage DOES fail) you better have a good CV ready for the next job.
    • It should be a different OS too. Eg some form of linux. Don’t protect you Windows VMs with backups to another windows VM!!! I’ve seen some do it. Backup VM was on the SAME host as some other VMs. Host gone – backups gone.
  • Have a second copy of backups.
    • That one is usually further away and more difficult to restore (restore takes longer). But more difficult to reach by an attacker too.
    • It is better to “pull” than to “push” to the 2nd copy. Because if you pull from NAS2, you only have READ permission on NAS1. That is more secure than giving WRITE permission to NAS1 for NAS2, which is a hole an attacker could leverage.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.