Abstract

Column store databases are widely accepted as the prevailing technology for analytical applications. As their data usually resides in-memory, they aim at the optimization of the access to the memory hierarchy. One way to achieve this is the employment of compression of the data. Compression leads to a better cache utilization and the reduction of transfer times and hence reduces I/O times. However, compression algorithms introduce a certain computational overhead. It is therefor crucial to implement them as efficient as possible. Numerous compression techniques can be found in the literature. While each of them produces compressed data of exactly one format, different formats might be differently well-suited for certain purposes. This motivates the need to change the compressed format some data is represented in. When done naively, the change of the compressed format requires a slow complete decompression and recompression of the data. Thus, direct transformations between different formats are desirable. Consequently, this thesis deals with two related classes of algorithms: compression techniques and transformation techniques. The efficient implementation of these algorithms is mainly achieved by making use of SIMD instruction set extensions. Several compression and transformation techniques are presented. Thereby, this thesis focuses on three classes of compression techniques: Run Length Encoding, Dictionary Coding and Null Suppression. An extensive evaluation conducted on synthetic test data verifies the high performance of the presented vectorized algorithms compared to purely sequential ones respectively of the presented transformation techniques compared to a complete decompression and recompression.

More