CS 591 A1

Data Systems Architectures

Class at a glance

Class: Tu/Th 12:30-1:45pm, MCS B23
Instructor: Manos Athanassoulis 

Teaching Assistant: Subhadeep Sarkar 
(Office Hours: MCS 283, M/W 2-3pm)

Office: MCS 279
OH: Tu 2-3pm/Th 2:30-3:30pm

Discussion on Piazza


  • Final grades are now uploaded!
  • Register for your project presentation.
  • A list of useful links added in the project page.
  • Our first visitor is coming on March 5th!
  • More details about the projects are available in the project page.
  • Slides for Class 1 are available. We will be posting slides as soon as possible after class.
  • Register for presentations!
  • The schedule is populated with the papers to discuss.
  • Syllabus is now online.
  • Website is up. Stay tuned for more!
  • Class starts on 1/22.

Class Milestones - Important Dates

  • February 4th, last day to add (any) class
  • February 8th, select a project
  • February 22nd, come up with a proposal (which you have discussed in OH)
  • March 8th, submit your mid-way project report
  • April 28th, April 29th 11:59pm, submit your final project report/code
  • April 30th and May 2nd, project presentation (in class)
  • May 7th 11:59pm, final submission of amended report (if needed) with maximum project grade change 10%

Also, before each class submit your paper review!

Class Schedule

Here you can find the tentative schedule of the class (which might change as the semester progresses).

Class : Introduction to Data Systems and CS591

In this class we will discuss the basics of data systems and the goals and structure of the course.


Class : Data Systems Architectures Essentials – Part 1

In this class we discuss the fundamental components that comprise a database system. We will see the commonalities and the differences of the main database system architectures and we will discuss why we have several different ones.


Class : Data Systems Architectures Essentials – Part 2

In this class we continue discussing data systems architectures and the basics for modern systems including relational, graph systems, and key-value stores.


Class : Class Project Overview

In this class the students will be introduced to the class semester project.


Class : Storage Layouts: Row-Stores vs. Column-Stores

Concepts: column-stores, row-stores, vertical partitioning, index-only plans, materialized views, tuple reconstruction, late/early materialization, block iteration, vectorized execution (block iteration), compression (run length encoding), hash joins, index joins, sort-merge joins, invisible joins, star schema


Class : Storage Layouts: Adaptive & Hybrid Layouts

Concepts: on-line transaction processing (OLTP), on-line analytical processing (OLAP), n-ary storage model (NSM), decomposition storage model (DSM), partition attributes across (PAX), flexible storage model (FSM), projectivity, selectivity, concurrency control, multi-version concurrency control (MVCC), two-phase locking (2PL)


Class : New Hardware: Data Systems for Flash

Concepts: hard disk drives (HDD), solid-state drives (SSD), flash-based SSDs, in-place vs. out-of-place updates, external sorting, migration overhead, flash-aware design (avoid random writes, prefer sequential access patterns, minimize physical writes per logical update), read-memory-update tradeoff


Class : New Hardware: Data Systems for Multi-Core

Concepts: multi-core, many-core, multi-socket, load balancing, skew resistance, context switching, non-uniform memory architectures (NUMA), pipeline breaker, elasticity, thread pool, just-in-time (JIT) code compilation, lock-free data structures, hyper-threading, translation lookaside buffer (TLB), open addressing, morsel-driven parallelism, dynamic hashing, outer join, semi-join, anti-join, radix join


Class : Indexing A: Updateable Bitmaps

Concepts: bitmap indexing, bitvectors, fence pointers, out-of-place updates, query-driven merging, bitmap encoding schemes (RLE, BBC)


Class : Indexing B: Access Path Selection

Concepts: access path selection, selectivity, concurrency, break-even point, modeling, scans, shared scans, concurrent index accesses, hybrid layouts, column-groups, memory bandwidth vs. latency


Class : Modern Storage Engines: Log-Structured Merge Trees

Concepts: log-structured merge trees (LSM-trees), fence pointers, Bloom filter, size ratio, merge policy, leveling, tiering, optimize for workload, sorted runs, levels, Lagrange multipliers


Research Talk : Scaling Write-Intensive Key-Value Stores

Visiting lecture from Niv Dayan, Harvard University

Bio: Niv Dayan is a postdoc at the Data Systems Lab at Harvard. He holds a PhD from the IT University of Copenhagen. Niv works at the intersection of systems and theory for designing efficient data storage, and in his current work he identifies and maps the fundamentally best scalability trade-offs for key-value stores.


Class : Modern Storage Engines: HTAP Systems

Concepts: key-value stores, state, point queries, blind updates, read-modify-write, locality, epoch, latch-free design, cache-friendly design, immutable file, mutable file, append-only systems, in-place updates, conflict-free replicated data type (CRDT)


Class : Indexing C: Data Skipping

Concepts: partitioning, horizontal partitioning, vertical partitioning, hybrid partitioning, zonemaps, tuple reconstruction, normalized schema, denormalized schema, clustering, use of clustering and feature extraction for partitioning


Class : Indexing D: Adaptive Indexing

Concepts: adaptive indexing, cracking, stochastic cracking, hybrid cracking, scan, sort and binary search, adaptive adaptive indexing, radix partitioning, TLB, software managed buffers, non-temporal streaming stores, partitioning fanout, skew, adaptive indexing convergence rate, simulated annealing, uniform/normal/zipfian distribution


Class canceled - Replaced by OH

Research Talk : Introduction to Time-Series Data Mining

Visiting lecture from John Paparrizos, University of Chicago

Bio: John Paparrizos is a postdoctoral researcher at the University of Chicago. He holds a PhD in Computer Science from Columbia University. His research revolves around developing methods, tools, and systems to help store, organize, retrieve, analyze, and transform into actionable knowledge massive collections of evolving data (i.e., time series, streams of data, and sensor/IoT data).

Class : In-Situ Data Processing: Efficiently Accessing Raw Data Files

Concepts: in-situ query processing, raw data files, adaptive partitioning, fine-grained indexing, query-based vs. homogenous partitioning, implicit clustering, eviction policy, workload shift, memory consumption


Class : Scientific Databases: Multi-dimensional Data Management

Concepts: array management systems, multi-dimensional arrays, storage manager, tiles, thread-safe, process-safe, atomicity, dense vs. sparse arrays, global cell order, fragments, dense vs. sparse fragments, consolidation


Class : Distributed Databases: Global Distributed Systems

Concepts: global-scale distributed database, concurrency control, Paxos, data sharding, external consistency, TrueTime API


Class : Map/Reduce: Data Management at Scale

Concepts: Map/Reduce, distributed file systems, resource management, positional delta trees, SQL-on-Map/Reduce, massively parallel processing database management systems (MPP DBMS), user-defined functions (UDF), encryption, authentication, user role management, elasticity, data warehouse, fact table, merge-join, partitioning attributes across (PAX) layout, message passing interface (MPI), two-phase commit (2PC), ACID

Systems/Approaches: Hadoop, Spark, YARN, HDFS Hive, Impala, Vectorwise, Actian Vector


Class : Data Systems for ML: Data Processing Primitives for ML

Concepts: machine learning, statistical analysis, data science, data exploration, data exploration systems, data systems for machine learning, Data Canopy, decomposable statistics, work repetition, materialized aggregates, data movement, basic aggregates, chunk, segment trees, offline/online/speculative execution, smart cache, eviction policy


Class : ML for Data Systems: Automatic Data System Design

Concepts: physical design, machine learning, tuning knobs, database administrator (DBA), OtterTune, workload characterization, k-means clustering, knob identification, automatic tuner, feature selection, linear regression model, ordinary least squares, workload mapping (dynamic vs. static), configuration recommendation


Class : Indexing E: Learned Indexes

Concepts: learned indexes, B+-Trees, hash indexes, Bloom filters, cumulative density function (CDF), recursive model index, hybrid learned indexes


Class : Indexing F: The Periodic Table of Access Methods

Concepts: data structure design, cost synthesis, first principles, learned cost models, design primitives, what-if design questions, design space, partitioning, search algorithms, benchmarking, model fitting


Class : Project Presentations A


Class : Project Presentations B