Although this is my final essay for Advanced Database Systems, I thought it was pretty good blog content, so I’m placing it here! Notably, I also wrote this essay without using ChatGPT.

Introduction

Database management systems (“DBMSs”) designed for online analytical processing (“OLAP”) workloads have become an ubiquitous presence in the technological sphere due to the rise of fields such as big data analytics and data science. One main goal of the designers of OLAP systems is to make them as highly performant as possible. The various techniques applied to achieve this goal has resulted in a myriad of different implementation methods to accelerate query execution for OLAP workloads. In this essay, we cover the major milestones and key developments in query execution for OLAP workloads over the past twenty-three years. We conclude by discussing the optimization that developers should consider first when building a new OLAP DBMS.

Key Developments in Query Execution

Early Columnar Storage Database Systems

In the early 2000s, data warehouses such as Sybase IQ and Addamark showed the benefits of using a column store architecture. A few years later, MonetDB and C-Store, two impactful column-store systems developed in the early 2000s, were released.

MonetDB is a column store relational database system that was developed by Centrum Wiskunde & Informatica (CWI) in the early 2000s. The first version of MonetDB used the materialization query execution model. Unfortunately, using the materialization model caused the performance of large queries to be bottlenecked by memory bandwidth. In 2005, MonetDB rewrote their query engine (called ``X100”) to use a vectorized query model [2]. This allowed their execution engine to reap the benefits of the Volcano model, while not sacrificing the important compiler optimizations (e.g. loop pipelining, which takes advantage of a multi-core CPU by executing operations in parallel) of the materialization model. Overall, the paper on MonetDB/X100 showed that by using the vectorized query execution model, one can design an execution engine that takes advantage of the parallelism offered by multi-core processors.

C-Store [10] is a column-store relational database system that was developed by Massachusetts Institute of Technology, Brandeis University, University of Massachusetts Boston, and Brown University, and released in 2005. The design of C-Store is optimized for a shared-nothing distributed architecture. C-Store also internally compresses its data, which will be discussed in the next section.

Compression in Column Store DBMSs

The compression of data is important for query execution acceleration. This is because it improves I/O performance by reducing seek times/data transfer rates and increases the buffer hit rate.

As mentioned in the previous section, C-Store is a column-store relational DBMS that compresses its data internally. In 2006, Daniel Abadi et. al. evaluate the compression methods used in C-Store [1], which include null suppression, dictionary encoding, run-length encoding, bit vector encoding, and several heavyweight compression schemes, including Lempel-Ziv encoding, Huffman encoding, and Arithmetic encoding. They also create a decision tree for the use cases of different compression schemes in OLAP DBMSs. The paper finds that choosing the correct compression scheme depends not only on data properties, but also on the query workload.

Taking Advantage of Parallelism

The increased availability and performance of multi-core hardware in the 2010s led database systems developers to attempt to take advantage of it as best they could. The results of their efforts include key advancements in the fields of query scheduling, query vectorization, query compilation, and parallel join algorithms. Ultimately, these key developments led to increased query performance in their respective OLAP systems.

Query Scheduling

One of the main developments of the query scheduling field is developing non uniform memory access (NUMA) aware query schedulers. Utilizing multi-processors that use NUMA correctly can lead to significant performance benefits, but it requires coordinating memory access and processing tasks between each of the different cores. If this is done incorrectly, since accessing memory over the interconnect is more expensive than accessing local memory, the query performance will be significantly worse. In 2014, Viktor Leis et al. present a NUMA aware, morsel-driven query scheduler [3], implemented in HyPeR, Technical University of Munich’s (TUM) academic DBMS. In their scheduling implementation, the workers perform cooperative scheduling by pulling tasks off a shared task queue. The workers will first try to execute tasks that operate on the morsels in its local memory, but if that is not possible, they just pull the next task off the queue. HyPeR also implements work stealing, so that workers that are idle can steal tasks from workers who are taking too long to execute the tasks they have already removed from the queue. In 2021, TUM published a paper on their improved lock-free, self-tuning, NUMA aware, morsel-driven stride scheduler [12]. This scheduler is implemented in Umbra, HyPeR’s successor. Although it uses some of the techniques mentioned in their 2014 paper, there are some key differences. First, the scheduler is thread-local, which means that each thread determines which active tasks it should execute by examining its own metadata. Second, this scheduler is self-tuning, which means that it simulates its own execution, then adjusts its own hyper-parameters by keeping track of query priorities.

Query Compilation

One important technique to increase the speed of OLAP queries is query compilation, where the query engine generates code either via just-in-time compilation or transpilation. Query code generation decreases the instruction count of the code ran, while doing the same amount of work. One example of research that shows that query compilation is an effective optimization technique is Thomas Neumann’s 2015 VLDB paper, where he presents a new data-centric compilation strategy that compiles a query into LLVM machine code [6]. His compilation strategy focuses on pushing data between operators so that tuples can remain in CPU registers for as long as possible. Overall, this strategy reduces the raw CPU costs of query processing, and the performance of the implementation (done in HyPeR) of this technique rivals hand-optimized code.

Meta’s Velox, an open source C++ execution engine acceleration library, also uses query compilation [7]. Velox uses query compilation in their vectorized expression evaluation engine. This engine is used in the evaluation of filter and projection operators, and in the implementation of predicate pushdown for their table scan operator. The engine takes a list of expression trees and applies several compiler optimizations to them, including common sub-expression elimination and constant folding. It also reorders Boolean expressions in a conjunctive expression by order of selectivity, and uses memoization to reuse compiled objects as needed. The expression evaluation engine returns a compiled executable.

Although both HyPeR and Velox use query compilation to increase the speed of query execution, and both use a push-based query processing model, their methods and extents of using query compilation are very different. Since Velox only implements a execution engine library, the extend to which query compilation is embedded in the system is not extreme, and it is only used in a limited context. On the other hand, HyPeR’s execution engine is rewritten to embrace the new data-centric compilation strategy.

Query Vectorization

Essentially every OLAP DBMS uses vectorized query execution with Single Instruction/Multiple Data (SIMD). Using SIMD improves the performance of OLAP DBMSs because when executing analytical queries, SIMD enables the execution engine to process data in parallel. For example, in 2015, Polychroniou et al. present implementations of selection scans, hash tables, partitioning, sorting, and joins using advanced SIMD operators [8]. For instance, they implement selection scans by evaluating the predicates in parallel, storing a bit mask of the qualifying offsets, and then propagating the qualifying tuples immediately to the output vector. They show that their operators accelerate query execution on both mainstream CPUs and a MIC-based Xeon Phi co-processor, compared to both scalar and compiler-vectorized implementations of their database operators.

The Effects of Cloud Platforms

Cloud computing rose to prominence in the 2010s, and cloud platforms now offer essentially ``unlimited” storage and compute on demand. Cloud computing has allowed database vendors to offer their products as a service, which means that the database is entirely stored and accessed through the cloud. Two examples of OLAP databases offered as a service are Amazon Redshift and Snowflake. Cloud computing prominence has also resulted in the usage of cloud object stores as a data storage backend. When using a cloud object store as a storage backend, the DBMS stores the database contents in large, immutable files in a service such as Amazon S3, Azure Blob, or Google Cloud Storage. Some benefits of this method include not having to write a storage system, having almost unlimited scalable storage, and being capable of separating one’s storage and compute resources, the latter of which will be discussed in the next section.

Distributed Storage and Compute

Prior to the existence of cloud platforms, most distributed OLAP systems had shared-nothing system architectures. For instance, both C-Store [10] and Greenplum [4] use a shared-nothing architecture. To take advantage of modern storage and compute resources presented by the cloud computing movement, DBMSs now use a distributed compute and storage layer, which scale independently of each other. Two modern OLAP systems that use this technique are Snowflake and Google Dremel.

Snowflake describes their process of disaggregating storage and compute in their 2020 NSDI paper [11]. They reveal that they separate compute, ephemeral storage (intermediate data used in the compute process), and persistent storage. Snowflake uses the abstraction of a virtual warehouse (VW) to represent compute. A VW is a set of AWS EC2 instances that run a set of queries. The EC2 instances in a VW share a distributed local storage system used for ephemeral storage. Snowflake uses AWS S3 for persistent storage. By adding the ephemeral storage layer, the Snowflake system does not sacrifice performance while maintaining high levels of both scalability and availability.

In the same year, Google’s Dremel also describes their disaggregated architecture [5]. Although Dremel used to be a shared-nothing distributed system, Dremel uses Borg, Google’s cluster management system, as compute, and a distributed replicated storage system, as storage. Dremel also extensively uses its distributed in-memory shuffle tier during query processing, which reduces memory access overhead. As a result, Dremel is significantly more scalable, faster, and capable of processing petabyte-sized tables.

The main ideas of Snowflake’s and Dremel’s disaggregated storage/compute architectures are similar, but the implementations of them are different. Both Snowflake and Dremel have three main layers: a compute layer, which consists purely of compute nodes, an intermediate local storage layer, and a persistent storage layer. However, Snowflake uses S3 for their persistent storage layer, while Dremel uses local disks that are managed by independent servers. Additionally, Snowflake uses their intermediate storage layer for two purposes: first, it serves as a method to exchange data between compute nodes in the same VW, and second, it serves as a write-through cache for persistent data. Meanwhile, Dremel uses their intermediate storage layer to store the intermediate results of queries.

Conclusion

The optimization that developers should consider first when building a new OLAP DBMS is to use a push-based query processing model. In 2011, HyPeR [6] showed that by using a push-based model, one can keep a tuple in a CPU’s registers for as long as possible, which makes good use of modern, multi-core CPUs. Additionally, both Velox [7] and DuckDB [9] use a push-based query processing model, with good results of query performance. For example, Velox shows that their query engine is on average, 6-7 times faster than the Presto Java engine. Lastly, a query execution engine based on the push-based query processing model intuitively supports parallelism. For example, DuckDB’s implementation of both Source and Sink methods, which get data and collect the results of operators, respectively, are parallelism-aware.

References

  • [1] Abadi, D., Madden, S., and Ferreira, M. Integrating compression and execution in column-oriented database systems. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2006), SIGMOD ’06, Association for Computing Machinery, p. 671–682.
  • [2] Boncz, P. A., Zukowski, M., and Nes, N. MonetDB/X100: Hyper-pipelining query execution. In Conference on Innovative Data Systems Research (2005), vol. 5, pp. 225–237.
  • [3] Leis, V., Boncz, P., Kemper, A., and Neumann, T. Morsel-driven parallelism: A numa-aware query evaluation framework for the many-core age. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2014), SIGMOD ’14, Association for Computing Machinery, p. 743–754.
  • [4] Lyu, Z., Zhang, H. H., Xiong, G., Guo, G., Wang, H., Chen, J., Praveen, A., Yang, Y., Gao, X., Wang, A., Lin, W., Agrawal, A., Yang, J., Wu, H., Li, X., Guo, F., Wu, J., Zhang, J., and Raghavan, V. Greenplum: A hybrid database for transactional and analytical workloads. In Proceedings of the 2021 International Conference on Management of Data (New York, NY, USA, 2021), SIGMOD ’21, Association for Computing Machinery, p. 2530–2542.
  • [5] Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T., Ahmadi, H., Delorey, D., Min, S., Pasumansky, M., and Shute, J. Dremel: A decade of interactive sql analysis at web scale. Proc. VLDB Endow. 13, 12 (aug 2020), 3461–3472.
  • [6] Neumann, T. Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow. 4, 9 (jun 2011), 539–550.
  • [7] Pedreira, P., Erling, O., Basmanova, M., Wilfong, K., Sakka, L., Pai, K., He, W., and Chattopadhyay, B. Velox: Meta’s unified execution engine. Proc. VLDB Endow. 15, 12 (aug 2022), 3372–3384.
  • [8] Polychroniou, O., Raghavan, A., and Ross, K. A. Rethinking simd vectorization for in-memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 2015), SIGMOD ’15, Association for Computing Machinery, p. 1493–1508.
  • [9] Raasveldt, M., and M ̈uhleisen, H. Duckdb: An embeddable analytical database. In Proceedings of the 2019 International Conference on Management of Data (New York, NY, USA, 2019), SIGMOD ’19, Association for Computing Machinery, p. 1981–1984.
  • [10] Stonebraker, M., Abadi, D. J., Batkin, A., Chen, X., Cherniack, M., Ferreira, M., Lau, E., Lin, A., Madden, S., O’Neil, E., O’Neil, P., Rasin, A., Tran, N., and Zdonik, S. C-store: A column-oriented dbms. In Proceedings of the 31st International Conference on Very Large Data Bases (2005), VLDB ’05, VLDB Endowment, p. 553–564.
  • [11] Vuppalapati, M., Miron, J., Agarwal, R., Truong, D., Motivala, A., and Cruanes, T. Building an elastic query engine on disaggregated storage. In Proceedings of the 17th Usenix Conference on Networked Systems Design and Implementation (USA, 2020), NSDI’20, USENIX Association, p. 449–462.
  • [12] Wagner, B., Kohn, A., and Neumann, T. Self-tuning query scheduling for analytical workloads. In Proceedings of the 2021 International Conference on Management of Data (New York, NY, USA, 2021), SIGMOD ’21, Association for Computing Machinery, p. 1879–1891.