Table and Index Compression

Compression feature was reserved for Enterprise Edition of SQL Server. With SQL 2016 SP1 we get it in all editions, e.g. Standard Edition. But, it is not always easy to say will we benefit from compressing certain table/index/partition or not. Here we will discuss that.

Good things

Compression makes more rows fit on the single page, so tables and indexes take less pages. That means they will take less space in memory, so more data can be cached into RAM (buffer pool). They will take less space in data files and less space in backups (depending on do you compress your backups natively, and I hope you do). The average row size will become smaller.

The biggest benefit is on “coldest” data that is not cached in buffer pool at all and must be read from disk. Compressed partitions are retrieved from disk faster because we need to read less pages from the disk.

The “cold” data that is cached in the buffer pool, but is not accessed as frequently can also be beneficial to compress, so they take less space and more “hot” data can fit into buffer pool. It is similar to buying more memory.

Bad things

It can only compress data placed into in-row allocation unit. That means row overflow (variable-length columns that could not fit into page) and LOB data (“max” types and deprecated text and image data types) will not be compressed. But LOB types, e.g. varchar(max), if allowed, can be placed to in-row page when they fit, so they also will be compressed. But for huge documents that cannot fit to in-row page (>8KB), compression has no effect, it is not even attempted.

Even with compression turned on, only some pages will be compressed, not every page. If the data is such that there is no space gain from compression, the page will not be compressed.

CPU is spent on every access of the compressed data. Cached pages in memory are also compressed, so they need to be unpacked on every read, and repacked on every write. Therefore, read of cached data that is frequently accessed will spend substantial amount of CPU, and actually make operations on very hot data slower than without compression. Hot data is cached in buffer pool anyway, and adding extra step of decompression only slows-down the access to that rows. That is why compressing cached, frequently accessed data is usually not a very good idea for performance.

How it works, in simple words?

Row compression turns fixed data types into variable ones. For example, int normally takes 4 bytes, fixed. But, if you store the number 123 into int and ROW compress, it will take only 1 byte (+ overhead of becoming a variable type).

Page compression first does row compression, and then looks at the entire page with some dictionary and prefix algorithms to find the repeating sequences within that single page. Of course, repeating sequence is then stored only once + overhead of describing repeating info.

The overhead bytes that are added can make compressed data larger than original, uncompressed data. So it is a good thing that the engine can decide not to compress certain pages even if we “compressed” that “entire” object.

To compress or not to compress – that is the question!

Concentrate on big tables. You do not want to mess with thousands of small tables to get negligible space gains. Limit further investigation only on those tables.
Make sure the objects are not heavily fragmented before compression estimation using sys.dm_db_index_physical_stats. E.g. check if defragmentation jobs are running. Because estimation for heavily fragmented objects can give you very weird, unbelievable low or unbelievable high estimation numbers. You get wrong estimation for heavy fragmented objects. Page splits cause not only fragmentation but also a lower page space usage, so we are wasting space not only because data is not compressed, but because of not properly handling fragmentation.
Check what percentage of partition is cached in the buffer pool with sys.dm_os_buffer_descriptors. For tables/indexes/partitions that are cached almost whole, you will not get any performance gain. They will actually be slower when compressed. But, if they are not very frequently accessed, you can still have a gain because they take less space in cache when compressed, and more other “hot” data can fit into cache and read less from the disk. Tables that are cached almost nothing have a big potential for the compression opportunity. Because when needed, they must be read from the disk, and that will be faster if they are compressed.
Check how frequently every partition is accessed with sys.dm_db_index_operational_stats. The more frequently accessed, the more CPU will be spent. If your CPUs are heavily used already, this additional burden might be too much for them. But if you have a lot of CPU headroom, and small portion of that table/index/partition is cached, you might benefit from compression even with data that is accessed more often than rarely.
Check the potential compression gain with sys.sp_estimate_data_compression_savings. If the gain is not significant, do not compress that. You would only spend a CPU time for no actual gain. Row compression gains lower than 15% or page compression saving of less than 30% is probably not worth the hassle.
Consider using Columnstore indexes. Page compression typically give 50% reduction, but columnstore gives much more: around 10x reduction, depending on data. Columnstore is radically different, it is not built for quick lookup of small number of rows, but we also can have b-tree page-compressed indexes on top of clustered columnstore table. But that is something for a new topic.

Conclusion

Since SQL 2016 it is a feature in all editions, so from now on everyone should consider using it! Or consider buying a SQL2016 licence and get the compression any many other enterprise features that are now included in all editions for no extra cost – a great gift from Microsoft!