Overview of Storage System: Transactional, Analytical, and Hybrid Database

Introduction to database systems

Back in the 1960s, disk-based databases were used to manage massive amounts of data and real-time transactional systems for numerous users. The foundation of everyday company activities in sectors like banking, retail, and e-commerce is these transactional systems. For OLTP database systems, Relational Database Management Systems (RDBMS) such as Oracle SQL, Microsoft SQL Server, MySQL, and PostgreSQL are commonly used.

But from time to time, the business data grew in volume and complexity. People then began to notice that maintaining historical data in an OLTP database resulted in slower queries and increased I/O consumption from read and write operations. Organizations started utilizing distinct OLAP systems for analytics and row-based OLTP databases for transactions in order to solve this problem.

Understanding the basic of OLTP and OLAP

What is OLTP?

Online Transaction Processing (OLTP) or operational database is a system designed to manage and execute large volumes of real-time transactions from multiple concurrent users. OTLP systems involve a fairly simple set of queries that are supposed to perform basic set of queries such as create and inserts, read, updates, and deletions (CRUD). OLTP system enables many people to execute a large number of database transactions.

What is OLAP?

Online Analytical Processing (OLAP) is a category of database systems designed to support complex queries and multidimensional analysis. OLAP systems are designed to evaluate historical data that has been aggregated from multiple sources, including OLTP systems. Usually, OLAP systems are located on a data warehouse or any other central data repository used by the company.

Key differences between OLTP and OLAP

There are several key differences between OLTP and OLAP systems, including:

Row-Oriented Database Systems

Row-oriented database systems (also know as relational database) store data by writing each row's contents sequentially to disk. All of the values connected to a single entity, such as name, address, phone number, or transaction, are contained in each row. Typical usage of DBMSs that are row-oriented include MySQL, PostgreSQL, Microsoft SQL Server, and SQLite.

They are optimized for for transactional workloads that regularly need reading, writing, or updating complete rows, such as in Online Transaction Processing (OLTP) systems. Relational database databases often use a block or page structure, where pages consist of several rows of data. This makes it possible to get specific records quickly. This makes it possible to get specific records quickly.

A Quick Guide to Entities in DBMS | https://www.boardinfinity.com/blog/a-quick-guide-to-entities-in-dbms/

Advantages using row-oriented database:

Efficient for transactional workloads

Row-oriented databases are ideal for transactional workloads that require fast and effective retrieval, updating, or insertion of whole records. Since all attributes of a record are stored together, accessing and changing a single row needs few disk seek time and rotational latency.

Simpler Implementation and Management

Row-oriented databases storage model aligns with the intuitive knowledge of data management in organization, making it easier to design, implement, and maintain databases. Row-oriented databases are easy to deal with for developers or engineers who are typically experienced with relational databases.

Easier schema changes

Row-oriented databases make it easier to modify the schema, allowing for straightforward updates, inserts, and deletes without significant overhead. This makes them suitable for situations involving frequent write operations or changing data requirements.

Disadvantages using row-oriented database:

Inefficient for analytical queries

For read-intensive analytical queries involving read-heavy analytical queries that involve aggregations and filtering row-oriented databases may execute more slowly. They frequently have to retrieve whole rows, including columns unrelated to the query, which increases I/O and slows down performance.

Less Efficient Compression

In row-oriented databases, rows typically contain diverse data types and properties, making it challenging to achieve high compression. This can lead to higher storage requirements.

Complex and expensive to scale

Effective data distribution over several servers and nodes may be more difficult with row-oriented databases since the data must be stored in rows. As the dataset expands, the solution often involves adding more powerful hardware, such as increasing CPU or memory capacities on current servers, which might become unaffordable.

Column-Oriented Database Systems

Column-oriented database systems (also know as columnar database) store data by columns instead of rows. This database structure stores all of the values associated with a specific attribute. In order to minimize input/output and increase speed, columnar databases only access the columns that are required, rather than scanning every column in each row.

Popular examples of columnar databases or analytical databases like BigQuery, Amazon Redshift, Snowflake, and Clickhouse are built with columnar storage. Online Analytical Processing (OLAP) systems involve complex queries that often scan large portions of data but only need a few columns and perform aggregations (e.g., SUM, AVG, COUNT). Columnar databases maintained long history of timestamp events and aggregating time series data to handle time-series analytics.

Column and Row Based Database Storage | https://bi-insider.com/wp-content/uploads/2014/07/column-oriented-database1.jpg

Advantages using column-oriented database:

High performance for analytical queries

Columnar databases excel in read-heavy analytical workloads and provide quick query processing times. This reduces I/O overhead and maximizes CPU cache utilization.

Efficient storage and compression

Data can be efficiently compressed because it is arranged into columns. Similar data types and properties within columns enable high compression ratios to reduce storage costs and improve query performance.

Scalability for Analytical Workloads

Since data is stored in columns, adding additional nodes or servers can more simple with each node handling a subset of columns. Large datasets may be handled by modern columnar databases, which also scale well across distributed systems.

Disadvantages using column-oriented database:

Slower for Write Operations

Columnar databases may perform more slowly for write-heavy transactional workloads including frequent updates, inserts, and deletes, even while they excel in analytical queries. Inserts and deletes can be more difficult to handle, and updates frequently will be more complex to manage.

Complexity in schema design

In order to maximize query efficiency, design a schema for a columnar database may need to be carefully considered. Complex a schema that involve multiple join tables can sometimes be more resource-intensive in columnar databases.

Less Efficient for Small Queries

For simple queries or transactions involving only a few rows, columnar databases can be less efficient. Columnar storage is less appropriate for these types of small-scale operations.

Modern Solution for Data System

While both row-oriented and columnar-oriented databases have unique advantages, there are some situations in which they are not suitable. These gaps are filled by hybrid architecture, which combine the benefits of both databases designs.

Hybrid Transactional/Analytical Processing (HTAP)

Single System for OLTP and OLAP can use hybrid row-wise and columnar or single data organizations for both ingestion and analytics. This database is built to facilitate high-volume transactional data, handle high-velocity transactional data, while also supporting complex analytical queries. This allows the system to scale horizontally, but it may also make maintaining data consistency more challenging.

Several database software worth to mention that offer HTAP solutions:

SingleStore: combines rowstore and columnstore within a single table to support mixed workloads efficiently.
TiDB: an open-source distributed HTAP system that uses TiKV for row-based storage and TiFlash as a columnar extension for analytics.
Google AlloyDB | F1 Lightning: cloud-native solutions that provide HTAP capabilities as a service.
Snowflake: hybrid tables specifically designed to function as an effective Hybrid Transactional and Analytical Processing (HTAP) system to eliminating the need for data replication and reducing architectural complexity

Simplified architecture of HTAP databases | https://static.pingcap.com/files/2023/04/17024548/image-97.png

Data Lakehouse Approach

A data lakehouse is a unified data storage architecture that combines data management and ACID transaction support with the affordability and adaptability of a data lake. A modern data architecture is designed to overcome these limitations by providing a scalable, integrated, and controlled method of data management, a contemporary data management. This method facilitates seamlessly move data between a central data lake and various purpose-built data services.

Several database software worth to mention that offer Data Lakehouse solutions:

Databricks: lakebase gives advantage to fully managed Postgres database with fast reads, writes, and updates.
AWS cloud data lakehouse: redshift for high-performance structured analytics and machine learning is frequently combined with S3 for storage.
Google cloud data lakehouse: BigQuery, which serves as a serverless data warehouse and can query data stored directly in GCS using BigLake, serves as the central hub of the lakehouse architecture.

Lakebase syncs Unity Catalog tables | https://docs.databricks.com/aws/en/oltp/projects/reverse-etl

Summary

Database systems evolved from initial disk-based transactional OLTP (Online Transaction Processing) systems, which use row-based storage, to separate OLAP (Online Analytical Processing) systems as data volume and complexity increased. Row-oriented databases, such as MySQL and PostgreSQL, store entire records contiguously, making them highly efficient for transactional workloads involving frequent insertion, update, and retrieval of full records. In contrast, column-oriented databases, like Amazon Redshift and Snowflake, store data by attribute, which allows them to excel in read-heavy analytical queries, achieve higher compression rates, and only read necessary columns, significantly reducing I/O. This dichotomy ultimately led to the development of modern solutions like Hybrid Transactional/Analytical Processing (HTAP) and Data Lakehouse architectures, which integrate both row-wise and columnar approaches to handle both high-volume transactions and complex analytics within a single system. I hope you enjoyed reading this.

Overview of Storage System: Transactional, Analytical, and Hybrid Database

Introduction to database systems

Understanding the basic of OLTP and OLAP

What is OLTP?

What is OLAP?

Key differences between OLTP and OLAP

Row-Oriented Database Systems

Column-Oriented Database Systems

Modern Solution for Data System

Hybrid Transactional/Analytical Processing (HTAP)

Data Lakehouse Approach

Summary

Comments

More from this blog

Demystifying Data Warehouses

Introduction to Data Processing and Data Transformation

Data Modeling Fundamentals part-2: Data Modeling Approach and Techniques

Data Modeling Fundamentals part-1: Introduction to Data Model and Data Modeling Types

Command Palette

Introduction to database systems

Understanding the basic of OLTP and OLAP

What is OLTP?

What is OLAP?

Key differences between OLTP and OLAP

Row-Oriented Database Systems

Column-Oriented Database Systems

Modern Solution for Data System

Hybrid Transactional/Analytical Processing (HTAP)

Data Lakehouse Approach

Summary

Comments

More from this blog