Optimal Data Organization for Hybrid Transactional/Analytical Processing Data Systems

(2019-) NSF Award #1850202, PI, 175K

Acknowledgment: This material is based upon work supported by the National Science Foundation under Grant No. IIS-1850202.
Disclaimer: Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


Scientific, commercial, and governmental applications increasingly rely on data-driven insights and decision-making using both historical data and real-time updates. New workloads are generated by social feeds, sensor readings (a common use-case of Internet-of-Things applications) and electronic micro-payments (an emerging model of e-commerce). They all have in common: (i) a very high volume of transactions and (ii) a high volume of analysis queries that need to use both historic and real-time data to provide useful and actionable insights. The primary challenge is that these workloads have conflicting requirements, and typically use different data systems architectures. On the one hand, we want to be able to answer analysis queries like, "what was the most discussed topic in each month of the past year?", or "what is the average power consumption per neighborhood of city X?". On the other hand, we want to efficiently store incoming updates and be able to provide real-time insights like "where do we have a power network overload now?", or "what is the probability that a disaster is taking place based on the social feeds of a city X?".

Traditionally, data systems were engineered to efficiently support either a transactional workload -- that is, storing quickly new items -- or an analytical workload. The latter typically includes changing the data layout and organization, and building auxiliary indexing structures to allow for efficient data access. The emergence of complex workloads has pushed towards the need to develop new systems that can support hybrid transactional/analytical processing (HTAP). This research will allow to execute such workloads efficiently and to anticipate workload changes in a robust way. Ultimately, the project will make data ingestion and data analysis a smoother process and will enable complex applications to have their data analyzed quickly.

We propose a new optimal data organization framework that efficiently supports simple and complex workloads alike, offering robustly higher performance than the state of the art. To do so, we exploit existing lack of disorder and we balance the "structure" added. Existing HTAP systems typically vary the vertical data layout, but do not search for an optimal "horizontal" data organization, that balances read and update performance. Here, we lay the foundations for developing HTAP data systems that can optimally organize data and exploit "lack of disorder".